Empty file output from pyscenic ctx using original/modified databases

stanaka6 commented 3 years ago

Hi pySCENIC team,

Thank you for providing a great tool. I am running pyscenic using my original zebrafish data. However, I got an empty file output (reg.csv) from pyscenic ctx, which is a very similar situation in #177. After execution of the pyscenic ctx function, a lot of warnings were shown up like:

2021-08-30 10:12:59,909 - pyscenic.transform - WARNING - Less than 80% of the genes in tfdp1a could be mapped to zf1.genes_vs_motifs.rankings. Skipping this module.
[                                        ] | 0% Completed | 33.4s

or

2021-08-30 10:13:04,246 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for foxp2 could be mapped to zf1.genes_vs_motifs.rankings. Skipping this module.
[                                        ] | 0% Completed | 33.7s

There is no error message, and the empty output reg.csv means this:

Could you please provide me a solution to fix this issue?

More details are described below.

I ran this code:

!pyscenic ctx adj.tsv \
    {f_db_names} \
    --annotations_fname {f_motif_path} \
    --expression_mtx_fname {f_loom_path_scenic} \
    --output reg.csv \
    --mask_dropouts \
    --num_workers 20

zebrafish cisTarget databases were created by following this repo using zebrafish 5'UTR's up and downstream 10k bp genomic regions and motif information from JASPAR. The feather file I used for pyscenic ctx is zf1.genes_vs_motifs.rankings.feather.

  motifs   ABCA7 ABCD2   ABR ACADSB ACAP2 ACBD3 ACSF3 ACTC1 ACVR1C   AK6 AL627305.1 
1 MA0002.2  9891 20179 10742   2567  6236  6282 13648  5347  10663 16867        644
2 MA0003.4  7236 18069 10906  14978  5419 16347 16718  4296  12922  5931        233
3 MA0004.1 19676  6688 14874   8093 20146 18805 13125 18632   7730 19189      15629
4 MA0006.1 10872 17790  6219     50 11655 20120 14065  1775  17164 19464       6347
5 MA0007.3 10583 17451  3597  12343  9659 19054  5885  9608   5872 14260      14541

The head of the adjacencies matrix (adj.tsv):

        TF  target  importance
0  tfcp2l1     prl  253.933308
1  tfcp2l1   nppal  234.960120
2     ybx1    rpl8  158.685731
3     ybx1  rps15a  156.212955
4     ybx1   rpl11  156.103158

According to this comment, I have created zebrafish Motif2TF database by the following step: 1. find zebrafish orthologs by orthofinder of mouse genes; 2. substitute mouse genes to zebrafish ortholog genes in motifs-v9nr_clust-nr.mgi-m0.001-o0.0.tbl. The created db looks like this:

#motif_id         gene_name       motif_name  motif_description source_name source_version  motif_similarity_qvalue similar_motif_id  similar_motif_description orthologous_identity orthologous_gene_name orthologous_species  description
metacluster_89.24 ran               RAN         RAN               hdpi        2009            0                       None              None                        0.925926               ENSG00000132341       H. sapiens           gene is orthologous to ENSG00000132341 in H. sapiens (identity = 92%) which is directly annotated for motif
metacluster_14.17   si:ch73-238c9.1 C19orf25      C19orf25          hdpi          2009            0                       None              None                        0.669725               ENSG00000119559       H. sapiens           gene is orthologous to ENSG00000119559 in H. sapiens (identity = 66%) which is directly annotated for motif
hdpi__ACF           a1cf            ACF         ACF               hdpi        2009            0                       None              None                        0.892437               ENSG00000148584       H. sapiens           gene is orthologous to ENSG00000148584 in H. sapiens (identity = 89%) which is directly annotated for motif
metacluster_28.13 abcf2a          ABCF2       ABCF2             hdpi          2009            0                       None              None                        0.982484               ENSG00000033050       H. sapiens           gene is orthologous to ENSG00000033050 in H. sapiens (identity = 98%) which is directly annotated for motif

pyscenic version 0.11.2 running in conda environment

Any suggestions or comments would be really appreciated.

cflerin commented 2 years ago

I believe that using the updated pySCENIC >=0.11.3 will fix this.

stanaka6 commented 2 years ago

Hi @cflerin ,

I truly appreciate your answer. I may misunderstand but the latest version is 0.11.2, right? Did you mean that that issue is unable to resolve until the newer version is released? I have tried updating pyscenic but it's still 0.11.2.

Thank you for your help.

cflerin commented 2 years ago

Sorry @stanaka6! 0.11.2 is indeed the current version, I must have gotten mixed up with another issue.

Here it seems that all of your modules are dropped due to low overlap with the database. I notice that your adj.tsv has lower case gene names vs upper case in the database (at least the few that I can see). So you should check to make sure you're using the same gene symbols across both, otherwise there's no way that an overlap can be found.

stanaka6 commented 2 years ago

Hi @cflerin,

Thank you very much for your quick response.

Zebrafish gene names (symbols) are consist of both upper and lower cases. For example, ABR (ENSDARG00000059587; Chr5:62374092-62545913), abr (ENSDARG00000095180; Chr5:62611400-62642718) and MDFIC (ENSDARG00000074223; Chromosome 4: 6,339,572-6,416,414), etc. Therefore, my input data contains both upper and lower-case gene symbols. I am not sure if this causes the issues.

I manually checked overlaps and could see them (for example, TF "tbx3a" in both adj.tsv and Motif2TF db; tbx3a's target gene "dio2" in .feather cisTarget ranking database).

Also, I found there are only 71 warning messages when running pyscenic ctx function (described in my 1 st post), but I do not understand this is relevant to my problem.

Here is the link for my input files (except for our original expression matrix loom file): https://umbc.box.com/s/0dez16fxn4nwvvhojr3y9hz43dq5442n I would really appreciate it if you have a chance to look at them if possible.

One of my concerns is whether the zebrafish Motif2TF database is good. As described above, I identified zebrafish orthologs of mouse genes by Orthofinder, and substituted mouse genes to zebrafish ortholog genes in motifs-v9nr_clust-nr.mgi-m0.001-o0.0.tbl. A mouse gene often matches several zebrafish genes -- and vise versa(see below pic; this is Orthofinder's output).

Therefore, I made a list of mouse and zebrafish genes from the above data, and separate the rows by commas. Also, I removed the genes from the Motif2TF database if there are no matched zebrafish genes. I am not sure the way I performed was correct.

The below is the partial code that I used for creating the zebrafish Motif2TF database.

## import motifs-v9nr_clust-nr.mgi-m0.001-o0.0.tbl as csv ----
tbl <- read.csv("motifs-v9nr_clust-nr.mgi-m0.001-o0.0.csv", header = TRUE)

## Make zebrafish-mouse gene substitute list ----

# Extract gene symbol from "motifs-v9nr_clust-nr.mgi-m0.001-o0.0."
mm.genes <- tbl$gene_name

# Annotate ensemble ID
library(AnnotationDbi)
library(org.Mm.eg.db)

df <- data.frame(gene = mm.genes)
df$EnsembleID <- mapIds(org.Mm.eg.db, keys = df$gene, column = "ENSEMBL", keytype = "SYMBOL")

# Leave unique values 
unique.df <- df[!duplicated(df$gene), ]
colnames(unique.df) <- c("gene","mouse") # change colnames 

## Load Orthofinder output (above picture) ----
orthofinder <- read.csv("Mus_musculus.GRCm39.pep.all__v__Danio_rerio.GRCz11.pep.all.csv")

colnames(orthofinder) <- c("orthogroup", "mouse", "zebrafish") # change col names 

## Separate comma-separated values (mouse gene) 

separate.df <-  orthofinder %>%
  tidyr::separate_rows(zebrafish, sep = ",") %>%
  tidyr::separate_rows(mouse, sep = ",") %>%
  dplyr::group_by(mouse) %>%
  dplyr::summarise(zebrafish = paste0(sort(unique(na.omit(zebrafish))), collapse = ','))

# remove version numbers in mouse EnsembleID

separate.df$mouse <- sub("\\..*", "", separate.df$mouse) 

# Merge with the gene list from  "motifs-v9nr_clust-nr.mgi-m0.001-o0.0"

dfA <- unique.df %>%
  dplyr::left_join(separate.df)

## Add zebrafish ID to Motif2TF table ---

dfA <- dfA[, -which(names(dfA) %in% "mouse")] # Remove mouse ensemble ID
colnames(dfA) <- c("gene_name", "zebrafish") # Change col names

merge <- merge(dfA, tbl, by = "gene_name", all = TRUE)

## separate comma-separated rows in zebrafish ID ----

merge2 <- tidyr::separate_rows(merge, zebrafish, sep = ",")

# remove version numbers in zebrafish EnsembleID
merge2$zebrafish <- sub("\\..*", "", merge2$zebrafish) 

## Remove NA containing rows

completeFun <- function(data, desiredCols) {
  completeVec <- complete.cases(data[, desiredCols])
  return(data[completeVec, ])
}
merge2 <- completeFun(merge2, "zebrafish")

## Then, get the zebrafish gene name (symbols) and add them as a column...

Any suggestions would be truly appreciated.

Arinze-BioX commented 2 years ago

I am having a similar issue. I would appreciate it if a solution is given to this. Thank you!

frucelee commented 2 years ago

I am having a similar issue. I would appreciate it if a solution is given to this. Thank you!

Hi, I face the same problem. Do you have some solution for this? Really appreciate you if you have some comments. Thanks in advance,

Arinze-BioX commented 2 years ago

Hi Frucelee, I have solved my own issue. I found that the problem was that I had incorrectly formatted my initial loom file. In your loom file, you can check to ensure that the Var and Obs are correctly set. You can also check that you're using the right organism and database. Best wishes.

MubasherMohammed commented 2 years ago

Hi everyone, did someone manage to solve this issue? @Arinze-BioX it would be great to share on how to check on the right format of the loom file. thanks

Arinze-BioX commented 2 years ago

Hi @MubasherMohammed , I think what I did was basic wrangling to check that I have the right number of genes and cells, and that I haven't wrongly transformed my expression matrix. That is, if I remember correctly, the Var are the genes and the Obs are the Cells. Then I checked that I used the right specie, genome build (e.g. mm9 vs mm10) and databases based on the explanations here: http://htmlpreview.github.io/?https://github.com/aertslab/SCENIC/blob/master/inst/doc/SCENIC_Setup.html

You can choose the right database by browsing this list of databases: https://resources.aertslab.org/cistarget/

I have moved on from this project, so I vaguely remember the details, and I hope my brief explanation here helps.

If you keep having issues after trying this, please share details of your issue and I can dig deeper and try to help.

MubasherMohammed commented 2 years ago

Thank you @Arinze-BioX for your explanation. I also went through check for my files for pyscenic ctx regulons enrichment seems all good. I open this issue #389 hence and hope to get some insights on fixing it Thanks again

bio-visualisation commented 2 years ago

Hi @MubasherMohammed , I think what I did was basic wrangling to check that I have the right number of genes and cells, and that I haven't wrongly transformed my expression matrix. That is, if I remember correctly, the Var are the genes and the Obs are the Cells. Then I checked that I used the right specie, genome build (e.g. mm9 vs mm10) and databases based on the explanations here: http://htmlpreview.github.io/?https://github.com/aertslab/SCENIC/blob/master/inst/doc/SCENIC_Setup.html

You can choose the right database by browsing this list of databases: https://resources.aertslab.org/cistarget/

I have moved on from this project, so I vaguely remember the details, and I hope my brief explanation here helps.

If you keep having issues after trying this, please share details of your issue and I can dig deeper and try to help.

@Arinze-BioX @stanaka6 @MubasherMohammed Would you please explain how did you solve your problem regarding pyscenic ctx step empty output issue? I have raised one issue here: https://github.com/aertslab/pySCENIC/issues/407#issue-1301714888

aertslab / pySCENIC

Empty file output from pyscenic ctx using original/modified databases #325