OpenMS / OpenMS

The codebase of the OpenMS project
https://www.openms.de
Other
478 stars 318 forks source link

Searchengine parameter names not agreeing #7032

Open jpfeuffer opened 1 year ago

jpfeuffer commented 1 year ago

Sageadapter uses missed_cleavages, MSGF max_missed_cleavages etc.

See also #3255

jpfeuffer commented 1 year ago

Same for tolerances etc., even though the tolerances can be asymmetric one could at least use the same prefix

jpfeuffer commented 1 year ago

Ok, even more. Let's just say everything in SageAdapter should be renamed.

timosachsenberg commented 1 year ago

I think I mainly looked at the CometAdapter names. https://github.com/OpenMS/OpenMS/blob/develop/src/topp/CometAdapter.cpp#L200 But yes, we are not very consistent in OpenMS (I think there is a separate issue for that). Happy to fix names but "everything in SageAdapter should be renamed" is not really helpful.

jpfeuffer commented 1 year ago

But if I look at CometAdapter, then why not "minimum_peaks" or "spetrum_batch_size" or "max_variable_mods_in_peptide", or "min_peptide_length", or "max_peptide_length"? Together with parameters shared with MSGF, I think it is indeed the majority of param names that should be changed.

timosachsenberg commented 1 year ago

spectrum_batch_size is not batch_size in sage (which is number of files in parallel) and squeezing different parameters just to have the same name is imho worse (even if that means it is more difficult to replace one search engine for another).

The other (main) reason is that I just don't think the names in comet and msgf plus are very well-chosen. E.g. https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/SimpleSearchEngineAlgorithm.cpp#L80-L149 is imho better. But if we now start changing comet or msgfplus parameters, we will also break a lot of stuff. So for now I just stayed with the names used in sage (which is then also easier map to the documentation of sage)

I am willing to make changes but I just don't find the best solution obvious.

jpfeuffer commented 1 year ago

But spectrum batch size is the same as bucket size (according to the current parameter description).

jpfeuffer commented 1 year ago

Yeah I don't know. I have a strong feeling about consistency and vote for as many agreeing names as possible.

lazear commented 1 year ago

But spectrum batch size is the same as bucket size.

This is incorrect. Sage has no concept of a spectrum batch size. Sage reads batch_size files in parallel, completely, and searches all spectra at once. It then loads the next batch_size files, continuing until all files have been loaded and searched.

bucket_size is a parameter that allows tuning the internal fragment index data structure. A high number places more fragments in one bucket (wider range of fragment masses in one bucket), and a smaller number places less fragments in one bucket (small range of fragment masses in one bucket).

There is inherently a trade-off between fragment_tol and bucket_size in search speed due to cache latency/binary search. Larger fragment bucket sizes = fewer total buckets (n_buckets * bucket_size = total number of fragments).

I recommend using 8192 for high-res MS/MS. This means that most fragment queries will only have to search 1-3 buckets. Using the same value for low-res MS/MS will be fine, but will search much slower, since many buckets have to be searched. I recommend using 32k-64k for bucket_size for low-res MS/MS.

My understanding is that MSFragger uses a different scheme for internal bucketing, and decides on the width and number of bins for you. Eventually I would like to auto-set this parameter and make it opaque to the user, but for now, it is made visible and can be used to tune performance. If you have logic determining instrument type/low-vs-high res, then you could set this for the user.