aertslab / RcisTarget

RcisTarget: Transcription factor binding motif enrichment
31 stars 8 forks source link

RcisTarget::addSignificantGenes error #27

Open joel-tuberosa opened 2 years ago

joel-tuberosa commented 2 years ago

Hello,

I would like to perform an enrichment analysis with the following data:

target_genes - a vector of gene names corresponding to the tested set

motif_rankings - the loaded database mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather downloaded from here

motifAnnotations_mgi - annotation data loaded from the package with data(motifAnnotations_mgi)

I am running the following commands:

motifs_AUC <- calcAUC(target_genes, motif_rankings)
motifEnrichmentTable <- addMotifAnnotation(motifs_AUC,  motifAnnot=motifAnnotations_mgi)
motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable, 
                                                   geneSets=target_genes,
                                                   rankings=motif_rankings, 
                                                   nCores=1,
                                                   method="aprox")

And I got this error message from the last command:

Error in data.frame(row.names = motifNames, rankings[, geneSet]) : 
  duplicate row.names: 16388, 13294, 17330, 17112, 4188, 16530, 16844, 18101, 17737, 18186, 16084, 18886, 11338, 12655, 16219, 18026, 15061, 16371, 14701, 17214, 18246, 16884, 14225, 6681, 18323, 17761, 17628, 16022, 17015, 18869, 15726, 16565, 16104, 14604, 16384, 15421, 16625, 16326, 15902, 17124, 18335, 18696, 9916, 15847, 14092, 17177, 15993, 17593, 16026, 18152, 14512, 16552, 16644, 19879, 18012, 17748, 18443, 16515, 17100, 17378, 17796, 19198, 18076, 16489, 18470, 14162, 17199, 18253, 16231, 17396, 18081, 17258, 15458, 17295, 15894, 17249, 18312, 17144, 13580, 8484, 16764, 15581, 12946, 19774, 15787, 18527, 18199, 18438, 17575, 17425, 16641, 11742, 18372, 17682, 16088, 17187, 15967, 18070, 17644, 14814, 14675, 17816, 18090, 14718, 17172, 14284, 18289, 18512, 16494, 17723, 15823, 18852, 14540, 17799, 15400, 11594, 17008, 18074, 16253, 17293, 18373, 14628, 13187, 18236, 14654, 17097, 16927, 15662, 11932, 17926, 18632, 18596, 17650, 17991, 17725, 16096, 16249, 10919, 17093, 1

Do you have an idea how to fix this?

Thank you in advance.

Joël

ZYT-ZhangYunTao19941116 commented 1 year ago

on 17 Aug

I encountered the same problem, and after reviewing the source code I found that the problem was "motif_rankings-mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather". It has a lot of duplicate motif names in it. The solution is to use the old file named "mm9-tss-centered-10kb-10species.mc9nr.feather" from https://resources.aertslab.org/cistarget/databases/old/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/

davidsanin commented 1 year ago

@joel-tuberosa, did you get anywhere with this other than changing to an old annotation file? I am having the exact same error.

ZYT-ZhangYunTao19941116 commented 1 year ago

I just changed to the old file and then everything went well

发自我的iPhone

------------------ Original ------------------ From: DavidS @.> Date: Thu,Apr 20,2023 5:50 AM To: aertslab/RcisTarget @.> Cc: ZYT-ZhangYunTao19941116 @.>, Comment @.> Subject: Re: [aertslab/RcisTarget] RcisTarget::addSignificantGenes error(Issue #27)

@joel-tuberosa, did you get anywhere with this other than changing to an old annotation file? I am having the exact same error.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jdenavascues commented 1 year ago

I think I know what is the problem: the new and old version of the databases have a the column where the names of the motifs are stored in differen positions. In old databases it is the first position (colum name 'features', while in the new ones it is at the end (column name 'motifs').

Unfortunately, the code for 03_addSignificantGenes.R assumes that the first column contains the motif names (my comments):

.getSignificantGenes <- function(geneSet,
                                 rankings,
                                 signifRankingNames=NULL,
                                 method="iCisTarget",
                                 maxRank=5000,
                                 plotCurve=FALSE,
                                 genesFormat=c("geneList", "incidMatrix"),
                                 nCores=1,
                                 digits=3,
                                 nMean=50)
{...
  # the motifRankings S4 object becomes a dataframe
  rankings <- getRanking(rankings)
  # the 'indices' are obtained from the FIRST column!!!
  indexCol <- colnames(rankings)[1]
  ...
  # this will give you now a series of ranking values, as character... and not necessarily unique
  motifNames <- as.character(unlist(rankings[,indexCol]))
  # now you get repeated row.names as you have a list of numbers instead of unique motif names:
  gSetRanks <- data.frame(row.names=motifNames, rankings[,geneSet])
  # and this is where the error originates
  ...
}

I think this was intended to be handled before, within importRankings, where it does:

indexCol <- intersect(allColumns, c('motifs', 'tracks', 'features'))#  [1]
if(verbose) message("Using the column '", indexCol, "' as feature index for the ranking database.")

So in principle it is independent of position, but indexCol is not passed on to cisTarget, I think, and also it is clear from the comment that the motifName information is expected to be at the beginning of the dataframe.

However, I do not get the intended results from this message when I run importRankings. I have been using the Drosophila motifRankings, both "new" and "old". When I import them I get, with the old, the expected message:

> motifRankings_old <- importRankings("resources/motifdbs/old/dm6-5kb-upstream-full-tx-11species.mc8nr.feather")
Using the column 'features' as feature index for the ranking database.

But with the new, I get:

> motifRankings_new <- importRankings(".../.../dm6-5kb-upstream-full-tx-11species.mc8nr.genes_vs_motifs.rankings.feather")
Using the column '128up' as feature index for the ranking database.

'128up' is the name of the first Drosophila gene by alphanumeric ordering... but this cannot be the result of intersect(allColumns, c('motifs', 'tracks', 'features'))... I must be missing something ¯\_(ツ)_/¯

Anyway, the solution is to place the last column of the new database at the beginning before running cisTarget:

motifRankings_new@rankings <- dplyr::relocate(motifRankings_new@rankings, motifs)

Hope this helps.

davidsanin commented 1 year ago

Anyway, the solution is to place the last column of the new database at the beginning before running cisTarget:

motifRankings_new@rankings <- dplyr::relocate(motifRankings_new@rankings, motifs)

This does it! Thanks for the advice!