Open jpfeuffer opened 1 year ago
Same for tolerances etc., even though the tolerances can be asymmetric one could at least use the same prefix
Ok, even more. Let's just say everything in SageAdapter should be renamed.
I think I mainly looked at the CometAdapter names. https://github.com/OpenMS/OpenMS/blob/develop/src/topp/CometAdapter.cpp#L200 But yes, we are not very consistent in OpenMS (I think there is a separate issue for that). Happy to fix names but "everything in SageAdapter should be renamed" is not really helpful.
But if I look at CometAdapter, then why not "minimum_peaks" or "spetrum_batch_size" or "max_variable_mods_in_peptide", or "min_peptide_length", or "max_peptide_length"? Together with parameters shared with MSGF, I think it is indeed the majority of param names that should be changed.
spectrum_batch_size is not batch_size in sage (which is number of files in parallel) and squeezing different parameters just to have the same name is imho worse (even if that means it is more difficult to replace one search engine for another).
The other (main) reason is that I just don't think the names in comet and msgf plus are very well-chosen. E.g. https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/SimpleSearchEngineAlgorithm.cpp#L80-L149 is imho better. But if we now start changing comet or msgfplus parameters, we will also break a lot of stuff. So for now I just stayed with the names used in sage (which is then also easier map to the documentation of sage)
I am willing to make changes but I just don't find the best solution obvious.
But spectrum batch size is the same as bucket size (according to the current parameter description).
Yeah I don't know. I have a strong feeling about consistency and vote for as many agreeing names as possible.
But spectrum batch size is the same as bucket size.
This is incorrect. Sage has no concept of a spectrum batch size. Sage reads batch_size
files in parallel, completely, and searches all spectra at once. It then loads the next batch_size
files, continuing until all files have been loaded and searched.
bucket_size
is a parameter that allows tuning the internal fragment index data structure. A high number places more fragments in one bucket (wider range of fragment masses in one bucket), and a smaller number places less fragments in one bucket (small range of fragment masses in one bucket).
There is inherently a trade-off between fragment_tol
and bucket_size
in search speed due to cache latency/binary search.
Larger fragment bucket sizes = fewer total buckets (n_buckets * bucket_size = total number of fragments).
I recommend using 8192 for high-res MS/MS. This means that most fragment queries will only have to search 1-3 buckets. Using the same value for low-res MS/MS will be fine, but will search much slower, since many buckets have to be searched. I recommend using 32k-64k for bucket_size
for low-res MS/MS.
My understanding is that MSFragger uses a different scheme for internal bucketing, and decides on the width and number of bins for you. Eventually I would like to auto-set this parameter and make it opaque to the user, but for now, it is made visible and can be used to tune performance. If you have logic determining instrument type/low-vs-high res, then you could set this for the user.
Sageadapter uses missed_cleavages, MSGF max_missed_cleavages etc.
See also #3255