bjmt / universalmotif

Motif manipulation functions for R.
GNU General Public License v3.0
25 stars 8 forks source link

Cannot get scan_sequences to report p-values #12

Closed OXB-DC closed 3 years ago

OXB-DC commented 3 years ago

Hi, Thanks for developing the package! But I'm having a problem with getting scan_sequences to calculate p-values.

I'm trying to run scan_sequences with the motifs argument as a list of PWMatrixList objects, and sequences is a DNAStringSet. Its runs ok with calc.pvals = FALSE but when I set it to TRUE I get the following error message:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘convert_motifs’ for signature ‘"NULL"’

The code is: UM_scan_res_test <- scan_sequences(motifs=test_PWM,sequences = test_seq,threshold=0, threshold.type = "logodds", RC=TRUE,calc.pvals = TRUE,nthreads = 22)

I have tried using convert_motifs to convert the list of PWMatrixLists into a list of universalmotif objects and still get the same error.

I have tried to work around the issue by calculating p-values separately using motif_pvalue and can partially manage this with my data set. Unfortunately, there is a complicating secondary issue in that the results of scan_sequences (a DFrame) do not incorporate the motif ID (only '@listData$motif' which corresponds to '@listData$@name' of my PWMatrixList). It would be really helpful if it also outputs the ID slot (and this is output using as.data.table) which I think will always be unique. Where this all causes me headaches is the JASPAR database SPLICE collection where the motif names are not unique but the ID slot is. This means that I can output the results of the scan_sequences function (with scores), then use the scores as input to the motif_pvalue but I cannot supply the appropriate PWM in this particular case (it could be any of 6 because they are all named the same!).

I hope this makes sense. In the short term I will just not calculate p-values using univeralmotif for the JASPAR SPLICE collection.

As another aside I have deliberately set the threhold very low as I wanted to calculate a relative threshold in the same way TFBSTools does (and compare results). This uses a percentage (quantile) which is not percentage of the maximum score but a percentage of the range (from minimum to maximum)...this is digressing somewhat from the main issue though!

bjmt commented 3 years ago

Hi,

Thanks for the report. The calc.pvals is a recent addition so I appreciate the feedback.

I've found the source of the error message for scan_sequences() and corrected it here on github. It will take a few days for it to get to Bioconductor if that's where you're installing the package from.

As for the error you are getting when trying to convert the PWMatrixList: unfortunately I cannot replicate this one. If you wouldn't mind, could you share the object? (You can use the dput() function and paste the results.) Totally up to you, I understand if you want to keep it private.

As for the motif ID: first, the IDs from the PWMatrixList objects are carried over to universalmotif objects under the title Alternate name (the slot is called altname). As for it not being present in the DFrame: I predicted things like this could be a problem (i.e. using lists of motifs with duplicate names, or having the altname slot hold the unique identifier), so I added an extra integer index column (motif.i). You can then use this to associate the motif IDs with each row of the scan_sequences() output. For example, if you are working with a PWMatrixList:

IDs <- vapply(my_pwm_list, function(x) x@ID, character(1))
scan_output$ID <- IDs[scan_output$motif.i]

Finally, regarding threshold calculation: when using threshold.type = "logodds", the final threshold is calculated as the maximum possible score multiplied by the value of the threshold parameter. I've gone with this approach instead of getting the quantile of the range of possible scores because some motifs can have negative infinity minimum scores, and anyways generally people are only interested in positive scores (at least that's my impression -- doesn't a negative score basically mean it's more likely to be a background hit?). However, it is still possible to get the kind of threshold you want using the motif_score() function:

motifs <- convert_motifs(my_pwm_list)

# No normalization (requires that your motifs do not have negative infinity scores):
scores <- vapply(motifs, function(x) motif_score(x, allow.nonfinite = TRUE, threshold.type = "total"), numeric(1))
# With normalization:
scores2 <- vapply(motifs, function(x) motif_score(x, allow.nonfinite = FALSE, threshold.type = "total"), numeric(1))

# Now scan:
scan_sequences(my_pwm_list, my_sequences, threshold = scores, threshold.type = "logodds.abs", RC = TRUE, calc.pvals = TRUE)

Let me know if this solves your issues. If you don't want to pursue the second convert_motifs() error then I'll go ahead and close the issue.

Edit: typo

OXB-DC commented 3 years ago

Thanks for the really quick reply and update to the code with calc.pvals = TRUE. I have updated my package directly from github and it now appears to work with a simple test example.

There was no error when when trying to convert a PFMatrixList, I just tried converting my PFMatrixList object to a universalmotif object before running scan_sequences (and having now looked at the source, and read the help in more detail realised this wouldn't have made any difference anyway, and this is done automatically).

I couldn't get your exact code to work but it pointed me in the right direction. I managed to use the scores_motif function to calculate the min and max scores and from there calculate a different absolute threshold for reach element (in the same way that TFBSTools::searchSeq does when using a percentage threshold).

Maybe keep the issue open for a day or two in case I encounter any further issues. Thanks for the help anyway. It was much appreciated.

bjmt commented 3 years ago

Yes, I can see why the code wouldn't work --- I forgot to perform the quantile calculation, sorry about that.

Glad I was of help. I'll close the issue in a couple of days.

OXB-DC commented 3 years ago

So this is mostly now working but I am having problems with very specific input PWMs causing RStudio to abort (requires restart).

I have copied the source for scan_sequences (and all the necessary accessory functions), defined the input variables and gone through it line by line with an example of a problem PWM (this is the only 1 I have identified so far but there are others). Its been quite laborious. Anyway, I get to line 361, and this causes the abort:

    out$pvalue <- motif_pvalue(motifs[out$motif.i], out$score, use.freq = use.freq,
      nthreads = nthreads, allow.nonfinite = allow.nonfinite, k = motif_pvalue.k)

The dput() for the PWM is:

new("universalmotif", name = "LM40", altname = "CN0040.1", family = "", 
    organism = "Homo sapiens", motif = structure(c(-2.12395465449616, 
    -0.888956530335224, -2.05380206183545, 1.68898533119274, 
    -2.06039732558369, 1.53016323143546, -2.4646618883504, -0.521527543882626, 
    -1.50058150082382, 1.39482961992938, -4.96851513448883, 0.0381893039501086, 
    -1.63302050188377, 1.5197916427118, -3.61387808386758, -0.43016412966062, 
    1.62835639951109, -2.05879318255831, -2.54614282233213, -1.05798575660883, 
    -2.19044153063008, 0.189008780376107, -1.78294918891624, 
    1.33249264619988, -2.12395465449616, -2.05879318255831, -2.24465522706756, 
    1.8434398606671, 0.359930318200709, -2.76692627201666, 1.22643877996926, 
    -3.42379940917238, 1.73520175977255, -3.74649583346093, -1.15278060668211, 
    -3.61181747898124, -2.26014112173258, 1.6820616555792, -3.1645754647309, 
    -1.12806042801111, -0.271042966536174, -2.49045395065771, 
    -1.39546859527949, 1.4869826000752, -3.96370730181868, -1.93950687517175, 
    -3.61387808386758, 1.96201785350123, -3.96370730181868, 1.50934495201134, 
    -4.57733523106496, 0.131114698865368, 1.90800806072115, -4.2157840977931, 
    -3.80124574139517, -3.82806355006355, 1.78363325477044, -3.39297024929661, 
    -1.59779756016926, -3.82806355006355, 1.04147061270769, -4.91578062160256, 
    -5.50691437540247, 1.01249399843676, -1.72858934257226, -4.91578062160256, 
    1.75891374079746, -3.25748347951178, -0.75860991337402, -4.2157840977931, 
    1.61171365725559, -2.84969132280715, -0.307475314489587, 
    -1.88336938508686, 1.3941740309422, -2.73589128754153, 1.83467689942055, 
    -3.96214190038266, -2.38753788163716, -2.97324259102268, 
    0.134054393518394, -2.33177854740234, 1.23255940854213, -2.12146844533144, 
    -1.45900573423771, 0.75551474042414, -1.99541050743711, 0.848849376675759, 
    -1.34101748804916, -0.782441237407515, -2.82247225097757, 
    1.63711857127716), .Dim = c(4L, 23L), .Dimnames = list(c("A", 
    "C", "G", "T"), c("T", "C", "Y", "C", "A", "Y", "T", "R", 
    "A", "C", "T", "T", "Y", "A", "A", "W", "G", "G", "G", "A", 
    "R", "Y", "T"))), alphabet = "DNA", type = "PWM", icscore = 21.7784742398783, 
    nsites = numeric(0), pseudocount = 1, bkg = c(A = 0.253481102585962, 
    C = 0.253196930946292, G = 0.263285024154589, T = 0.230036942313157
    ), bkgsites = numeric(0), consensus = "TCYCAYTRACTTYAAWGGGARYT", 
    strand = "+", pval = numeric(0), qval = numeric(0), eval = numeric(0), 
    multifreq = list(), extrainfo = c(comment = "-", consensus = "TCCCATTGACTTCAAWGGGAGYT", 
    medline = "17442748", collection = "CNE", species.9606 = "Homo sapiens", 
    acc = ""), gapinfo = new("universalmotif_gapped", isgapped = FALSE, 
        gaploc = numeric(0), mingap = numeric(0), maxgap = numeric(0), 
        extrapvals = numeric(0)))

And it aborts RStudio regardless of whether I change the score or nthreads (FYI out$score was originally 30.094). All other arguments are defaults. Can you reproduce this on your side? And if so can you help to resolve?

Kind Regards, Danny

OXB-DC commented 3 years ago

Oh I also forgot to mention there also looked like there were some possible issues with missing braces and if statements. Not sure if this matters e.g. line 194-197 and in a few other places

OXB-DC commented 3 years ago

Went a bit further down the rabbit hole as far as

motif_pvalue_cpp <- function(motifs, bkg, scores, k = 6L, nthreads = 1L, allow_nonfinite = FALSE) {
    .Call('_universalmotif_motif_pvalue_cpp', PACKAGE = 'universalmotif', motifs, bkg, scores, k, nthreads, allow_nonfinite)
}

which is getting into C++ territory and presently beyond me!

bjmt commented 3 years ago

Thanks for the detailed reports, and I'm so sorry you're having all this trouble. I've been able to get crashing happening by playing around with motif_pvalue() and your motif. Something must be going wrong in the c++ function (as you correctly pointed out). I will investigate! But I won't get an answer for you for a few hours at least.

Again I really appreciate the feedback.

OXB-DC commented 3 years ago

So I tried changing k, and for k=4 through to k=7 it reported a value of 0 For k=8 it causes RStudio to abort For k=9 it reports a value of 0 For k=10, 3.033576e-12 For k=11, 2.234804e-11 For k=12, 2.693484e-11 For k=13, 2.693484e-11 And after that it gives a warning and reverts to k=13

I am presently rerunning all my scans with k=12...may take a while...will let you know

OXB-DC commented 3 years ago

It works fine with k=12

bjmt commented 3 years ago

Think I've found the bug. When k resulted in the motif being split into more than two (so for your case when k<12), there was a memory access bug which resulted in erratic behaviour and occasional crashes.

I've patched the bug now, and the Bioconductor release will get it in a day or two. I'll leave this issue open for another day or two in case you still have problems.

bjmt commented 3 years ago

Closing. As far as I know the bug is fixed in version 1.8.1 and above.