aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
417 stars 178 forks source link

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

Open VincentGardeux opened 5 months ago

VincentGardeux commented 5 months ago

Hi,

I'm setting up pySCENIC for generating regulons from two Fly brain scRNAseq datasets we generated. The goal is to reproduce to some extent what was done in the aging Drosophila brain atlas (Davie et al, 2018). I especially want to reproduce the separation in Figure 4 based on (dati, pros, Imp, scro) regulons.

I'm running pySCENIC v0.12.1 in JupyterHub through Docker, with the latest motif and feather files (v10): mc_v10_clust Everything seems to run ok using this (slightly tuned) tutorial.

I'm writing this issue because I think that there is an unwanted behavior in the prune2df function. And also as a guide for future users, since I struggled quite a lot to make this work.

First, I would encourage people to use the pySCENIC version (instead of the R, or the nextflow pipeline, whose maintenance is discontinued or doesn't work with the latest databases).

When running the tutorial, I got many warnings with the prune2df function. But the run still finished, and I ended up with 46 regulons... which did not seem like a lot. Especially because in the (Davie et al, 2018) paper the authors appear to have 163 regulons (Table S4). The main issue is that I didn't get the (dati, pros, scro) important regulons that I was looking for.

Hope this can help other users. And to the authors/maintainers please give me your opinion on point 3, as I really thought this behavior was weird/unexpected?

caochch commented 5 months ago

Any progresses? What's the expected number of modules or regulons in the ctx step (the row of output files)?

Flu09 commented 2 months ago

any news?

rrydbirk commented 2 months ago

Can confirm that following solution 2 and 3 improves no. regulons significantly! Also, completely omitted warnings for me.

colin893 commented 2 months ago

Hi. I have an issue while running pyscenic on Zebrafish dataset. I downloaded the Stanaka motifs database and processed my pyscenic analysis. Just to add, I successfully ran pyscenic on both mouse and human datasets.

First I got an error due to the wrong version format of the file that I converted into v2 without problem. Then I generated the adj.csv file that threw no error and is filled. The problem arises with the command pyscenic ctx - I got the usual warnings "Less than 80% of the genes in Regulon for XXXX could be mapped to v2_zf1.genes_vs_motifs.rankings. Skipping this module." but at the end, my reg.csv file is empty.

So I imported the genes from the feather file of Stanaka, and intersected these with the genes from my expression matrix, just to ensure that there is a significant overlap between gene names. For respectively 23000 and 20000 genes in both lists, I got +- 15000 genes in common. Given this, in my understanding, I would expect some regulons to survive the pruning step? So I am wondering if it's not because of something else?

Thank you in advance for your time,

VincentGardeux commented 2 months ago

Hi @colin893,

I'm still following the updates on this issue, but so far did not see any answer from the devs/authors. :(

For your issue, did you try all the solutions I suggested in my post? In my case they substantially increased the number of regulons.

If you still loose all your regulons at the pruning stage, you can probably remove the pruning completely using filter_for_annotation=False I have to admit that I tend to do that now, because my "preferred regulons" are always filtered out for reasons I don't understand.

Cheers

colin893 commented 2 months ago

Hi @VincentGardeux ,

When I posted I had only verified that there exist overlap between gene names in my expression matrix and the gene names in the databases.

Thanks for your different hints, I just launched the ctx command with --no_pruning and I got 599 regulons. I guess there might be a way to allow the pruning but, since I successfully analyzed human/mouse data, what comes to my mind is that the default values used for these are too stringent for zebrafish analyses? I'll have to do tests and modify some threshold values to ensure that hypothesis.

I'll update my response depending on what I get, thanks for your help!