aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
433 stars 181 forks source link

How can I read the content of a feather file? #353

Closed chansigit closed 2 years ago

chansigit commented 2 years ago

I am curious what are there inside a feather file from the cistarget database? Can you please explain what are there inside, and how we can read it?

chansigit commented 2 years ago

And can you please further elaborate on what is the genome ranking of the motifs? I read the SCENIC/pySCENIC paper and found little clues.

Thank you

SeppeDeWinter commented 2 years ago

The feather files are used for the motif enrichment analysis part (i.e. pruning step).

These files contain a matrix with one axis representing genes and the other axis representing motifs, the values in this matrix are rankings for each motif across the genes.

To generate this matrix first genomic regions surrounding each gene are gathered. To do so, all non-coding regions located in the neighbourhood of a gene will be assigned to genes. These regions include the promoter regions upstream and downstream to the transcription start site (TSS). The search space around each gene is set to 20 kb around the TSS, for human and mouse.

Next, each region is scored for motifs using cluster-buster (https://github.com/weng-lab/cluster-buster). Because each gene can have multiple regions (and thus multiple CRM-scores) we take the max of the CRM-scores over all regions linked to the individual genes and assign this max score as the motif score for each gene.

Finally, a ranking is generated for each motif across all genes based on these (max) CRM-scores.

We have a seperate github repository which contains functions to generate such a database, see https://github.com/aertslab/create_cisTarget_databases

To read these files you have to use a Feather reader, for example: pandas.read_feather, however these files are quite big thus this can take a long time (and a lot of memory).

For more info you can also read:

Imrichová,H., Hulselmans,G., Kalender Atak,Z., Potier,D. and Aerts,S. (2015) i-cisTarget 2015 update: generalized cis-regulatory enrichment analysis in human, mouse and fly. Nucleic Acids Res. doi: 10.1093/nar/gkv395

Herrmann,C., Van de Sande,B., Potier,D. and Aerts,S. (2012) i-cisTarget: an integrative genomics method for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res. doi: 10.1093/nar/gks543

Does this answer your questions?

SeppeDeWinter commented 2 years ago

Closing issue due to inactivity, feel free to open again if you have further questions.