[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons

VincentGardeux commented 5 months ago

Hi,

I'm setting up pySCENIC for generating regulons from two Fly brain scRNAseq datasets we generated. The goal is to reproduce to some extent what was done in the aging Drosophila brain atlas (Davie et al, 2018). I especially want to reproduce the separation in Figure 4 based on (dati, pros, Imp, scro) regulons.

I'm running pySCENIC v0.12.1 in JupyterHub through Docker, with the latest motif and feather files (v10): mc_v10_clust Everything seems to run ok using this (slightly tuned) tutorial.

I'm writing this issue because I think that there is an unwanted behavior in the prune2df function. And also as a guide for future users, since I struggled quite a lot to make this work.

First, I would encourage people to use the pySCENIC version (instead of the R, or the nextflow pipeline, whose maintenance is discontinued or doesn't work with the latest databases).

When running the tutorial, I got many warnings with the prune2df function. But the run still finished, and I ended up with 46 regulons... which did not seem like a lot. Especially because in the (Davie et al, 2018) paper the authors appear to have 163 regulons (Table S4). The main issue is that I didn't get the (dati, pros, scro) important regulons that I was looking for.

Problem: When running the cisTarget part (specifically, the prune2df function), I quickly get tons of Warnings: Less than 80% of the genes in ... could be mapped to ... which was already reported (but never really solved) in other issues: #466 #515 #506 #325 #177 and pull requests: #387
Problem: Kinda related to problem 1, but the number of regulons I get is very small as compared to what I would expect
Solution 1: I found out, that using the rho_mask_dropouts=True option in the step before (modules_from_adjacencies), it increases the number of regulons to 81, but still without my looked-for regulons (and still with all the warnings).
Solution 2: So I've set up to remove the warnings in the prune2df function. For this, I thought that the problem was coming from the *feather files, where genes were not overlapping with the genes I have in my expression matrix. So, I've extracted the genes used in the feather file, transformed them to Flybase/Ensembl IDs (btw using gene symbols instead of IDs is not a good idea I think), and updated the gene names in my expression matrix. So now, almost all of the feather genes are in my expression matrix. Then I've run again the pipeline, but was very surprised that it did not solve the warning issues... which did not make sense to me? It still worked to some extent, since got 99 regulons, but the increase was very marginal. And I did not understand why it was not solving the warning issues.
Solution 3: [Bug?] So I've thought that maybe the issue was coming from the fact that my expression matrix was containing extra genes that were NOT in the .feather files. So I've added this piece of code at the beginning of my script, to restrict my expression matrix to ONLY genes present in the .feather file:
```
import pandas as pd
# ex_matrix is my expression matrix
ranking_feather = pd.read_feather("dm6_v10_clust.genes_vs_motifs.rankings.feather")
overlap_values = ex_matrix.index[pd.Series(ex_matrix.index).isin(ranking_feather.columns)].unique()
ex_matrix = ex_matrix.loc[overlap_values, :]
```
This solves completely the warning issues in the pruning step, and I end up with 108 regulons. But this is not an expected behavior of the method, is it? I mean I understand the issues if some genes of the feather database are missing from the expression matrix, but it looks ok to have extra genes in your expression matrix? So why would this cause a warning? And limits the number of regulons generated?
Solution 4: Just to complete my post. Even with all these updates, I could not get the regulons I was looking for. I was finally able to get the regulons I wanted (total of 156) by setting the auc_threshold=0.01 parameter in the prune2df function (instead of 0.05 by default). I'm not sure though what is the real impact of this, as I could not find clear explanation of what this parameter is doing. Another way is to completely deactivate the pruning/filtering by using the filter_for_annotation=False argument in the prune2df function.

Hope this can help other users. And to the authors/maintainers please give me your opinion on point 3, as I really thought this behavior was weird/unexpected?

caochch commented 5 months ago

Any progresses? What's the expected number of modules or regulons in the ctx step (the row of output files)?

Flu09 commented 2 months ago

any news?

rrydbirk commented 2 months ago

Can confirm that following solution 2 and 3 improves no. regulons significantly! Also, completely omitted warnings for me.

colin893 commented 2 months ago

Hi. I have an issue while running pyscenic on Zebrafish dataset. I downloaded the Stanaka motifs database and processed my pyscenic analysis. Just to add, I successfully ran pyscenic on both mouse and human datasets.

First I got an error due to the wrong version format of the file that I converted into v2 without problem. Then I generated the adj.csv file that threw no error and is filled. The problem arises with the command pyscenic ctx - I got the usual warnings "Less than 80% of the genes in Regulon for XXXX could be mapped to v2_zf1.genes_vs_motifs.rankings. Skipping this module." but at the end, my reg.csv file is empty.

So I imported the genes from the feather file of Stanaka, and intersected these with the genes from my expression matrix, just to ensure that there is a significant overlap between gene names. For respectively 23000 and 20000 genes in both lists, I got +- 15000 genes in common. Given this, in my understanding, I would expect some regulons to survive the pruning step? So I am wondering if it's not because of something else?

Thank you in advance for your time,

VincentGardeux commented 2 months ago

Hi @colin893,

I'm still following the updates on this issue, but so far did not see any answer from the devs/authors. :(

For your issue, did you try all the solutions I suggested in my post? In my case they substantially increased the number of regulons.

If you still loose all your regulons at the pruning stage, you can probably remove the pruning completely using filter_for_annotation=False I have to admit that I tend to do that now, because my "preferred regulons" are always filtered out for reasons I don't understand.

Cheers

colin893 commented 2 months ago

Hi @VincentGardeux ,

When I posted I had only verified that there exist overlap between gene names in my expression matrix and the gene names in the databases.

Thanks for your different hints, I just launched the ctx command with --no_pruning and I got 599 regulons. I guess there might be a way to allow the pruning but, since I successfully analyzed human/mouse data, what comes to my mind is that the default values used for these are too stringent for zebrafish analyses? I'll have to do tests and modify some threshold values to ensure that hypothesis.

I'll update my response depending on what I get, thanks for your help!

aertslab / pySCENIC

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534