.. image:: img/arboreto.png :alt: arboreto :scale: 100% :align: left
.. image:: https://travis-ci.com/aertslab/arboreto.svg?branch=master :alt: Build Status :target: https://travis-ci.com/aertslab/arboreto
.. image:: https://readthedocs.org/projects/arboreto/badge/?version=latest :alt: Documentation Status :target: http://arboreto.readthedocs.io/en/latest/?badge=latest
.. image:: https://anaconda.org/bioconda/arboreto/badges/version.svg :alt: Bioconda package :target: https://anaconda.org/bioconda/arboreto
.. image:: https://img.shields.io/pypi/v/arboreto :alt: PyPI package :target: https://pypi.org/project/arboreto/
.. epigraph::
*The most satisfactory definition of man from the scientific point of view is probably Man the Tool-maker.*
.. arboreto: https://arboreto.readthedocs.io
.. arboreto documentation
: https://arboreto.readthedocs.io
.. _notebooks: https://github.com/tmoerman/arboreto/tree/master/notebooks
.. _issue: https://github.com/tmoerman/arboreto/issues/new
.. dask: https://dask.pydata.org/en/latest/
.. dask distributed
: https://distributed.readthedocs.io/en/latest/
.. GENIE3: http://www.montefiore.ulg.ac.be/~huynh-thu/GENIE3.html
.. Random Forest
: https://en.wikipedia.org/wiki/Random_forest
.. _ExtraTrees: https://en.wikipedia.org/wiki/Random_forest#ExtraTrees
.. _Stochastic Gradient Boosting Machine
: https://en.wikipedia.org/wiki/Gradient_boosting#Stochastic_gradient_boosting
.. _early-stopping
: https://en.wikipedia.org/wiki/Early_stopping
Inferring a gene regulatory network (GRN) from gene expression data is a computationally expensive task, exacerbated by increasing data sizes due to advances in high-throughput gene profiling technology.
The arboreto software library addresses this issue by providing a computational strategy that allows executing the class of GRN inference algorithms exemplified by GENIE3 [1] on hardware ranging from a single computer to a multi-node compute cluster. This class of GRN inference algorithms is defined by a series of steps, one for each target gene in the dataset, where the most important candidates from a set of regulators are determined from a regression model to predict a target gene's expression profile.
Members of the above class of GRN inference algorithms are attractive from a computational point of view because they are parallelizable by nature. In arboreto,
we specify the parallelizable computation as a dask graph [2], a data structure that represents the task schedule of a computation. A dask scheduler assigns the
tasks in a dask graph to the available computational resources. Arboreto uses the dask distributed
scheduler to
spread out the computational tasks over multiple processes running on one or multiple machines.
Arboreto currently supports 2 GRN inference algorithms:
Stochastic Gradient Boosting Machine
(SGBM) [3] regression with early-stopping
regularization.Random Forest
(RF) or ExtraTrees (ET) regression.Get Started
Arboreto was conceived with the working bioinformatician or data scientist in mind. We provide extensive documentation and examples to help you get up to speed with the library.
arboreto documentation
_.License
BSD 3-Clause License
.. _pySCENIC: https://github.com/aertslab/pySCENIC .. _SCENIC: https://aertslab.org/#scenic
Arboreto is a component in pySCENIC: a lightning-fast python implementation of the SCENIC pipeline [5] (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
References