UWPR / Comet

An tandem mass spectrometry (MS/MS) sequence database search tool.
https://uwpr.github.io/Comet/
Apache License 2.0
42 stars 13 forks source link

Impact of num_results and num_output_lines on execution time #42

Closed di-hardt closed 8 months ago

di-hardt commented 1 year ago

Hey,

I currently using Comet in an unusual way, I guess by using a FASTA composed of several 100000 target/decoy-peptides for only one spectrum. I'm interested in a scoring for all of those peptides. According to the documentation num_results and num_output_lines are limited to 100. Because both are not getting validated I increased them step by step which has an interesting impact on the execution time.

execution-times

I think that is one of reason for the limitation in the first place?
Anyway, I created a small repository with a Python script and some data to reproduce this: https://github.com/di-hardt/comet-result-limit

Would it be possible to get a scoring for all peptides without too much work?

Best, Dirk

jke000 commented 1 year ago

Hi Dirk. num_results and num_output_lines are not limited to 100 as it looks like you've discovered with your analysis. However in the code, there is an index variable that was defined as a "short" which does effectively limit the max value of num_results to SHRT_MAX or 32767 under linux otherwise the result is a segfault. Changing that variable to an "int" is a quick fix to address this which I just tested. Does your chart imply that you were able to complete searches with num_results set to 100000 and larger?

Presumably you would also want to add minimum_xcorr = -1000 to have Comet report all scored peptides, including those that score less than 0 which are normally filtered out.

I'll incorporate this short to int change in the next future release. In the meantime, here's a linux binary of release 2023.01.2 that incorporates this which should allow you to run searches with the parameter values set to values larger than 32K. Is this what you're looking for in the short term? I tested this with your data/database and can get out 100000 output lines with both parameter set to 100000. I am still curious how you were able to generate your chart run times without such a fix.

As for impact on run times, I'd have to spend time profiling the code to understand why run times slow down significantly when these parameter values are set to very large numbers. There's has to be some inefficient loop that I haven't been able to identify just browsing the code these past few minutes. I'll leave this issue open to remind me to investigate this in the future.

di-hardt commented 1 year ago

Hey, no no the documentation says it is limited. At least here: https://uwpr.github.io/Comet/parameters/parameters_202301/num_output_lines.html

... “num_results” which itself is limited to a maximum value of 100

I just wanted to show it is not limited but a (much) higher value has a huge impact on the runtime.

The num_results was really a short? Comet didn't complain about larger numbers in my benchmark above. In my actual analyse pipeline I limit num_results to 10000 otherwise the runtime it too high. In the meantime I checked the result files of the benchmark for 100000 and 336860. They are empty, I guess that is due to num_result was a short.

With your provided binary and minimum_xcorr = -1000 the files contain results know! But it seams the run times for 100000 and 336860 are worse now. I need to check it on a different machine.

di-hardt commented 1 year ago

Okay I tested on the same machine as before and the execution time are indeed higher for the large numbers. execution-times

jke000 commented 8 months ago

I'm going to close this issue as this isn't a use case that I plan on optimizing for. If this is a search you are going to be performing often and would like to optimize, I would start at CometSearch::StorePeptide(). Look at the end of that function where it specifies "Get new lowest score". Currently the code uses a simple for loop through all stored peptide matches to find the lowest scoring stored entry which is replaced if the next peptide scores higher. For extremely large number of output lines, you might want to evaluate replacing the for-loop with something much more efficient (maybe apply a quicksort) when the number of stored results become huge (aka 100,000 or 300,000 entries).