bigbio / quantms

Quantitative mass spectrometry workflow. Currently supports proteomics experiments with complex experimental designs for DDA-LFQ, DDA-Isobaric and DIA-LFQ quantification.
https://quantms.org
MIT License
28 stars 35 forks source link

Defining more enzymes in OpenMS Comet adaptor #354

Open ypriverol opened 6 months ago

ypriverol commented 6 months ago

Description of feature

Currently, @timosachsenberg @jpfeuffer comet only support 'Asp-N,Chymotrypsin,CNBr,no cleavage,unspecific cleavage,Trypsin,Arg-C,Lys-C,Lys-N,PepsinA,Trypsin/P,glutamyl endopeptidase' However comet has a way to pass a definition of more enzymes https://uwpr.github.io/Comet/parameters/parameters_202301/search_enzyme_number.html using a parameter file. How can we use that possibility to define for example Lys-C/P currently Lys-C will not work because msgf+ processor change it to Lys-C/P and comet do not supported it.

timosachsenberg commented 6 months ago

quick fix: adding the comet id here https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/CHEMISTRY/Enzymes.xml#L142-L163 and adding the output here https://github.com/OpenMS/OpenMS/blob/develop/src/topp/CometAdapter.cpp#L567-L579

timosachsenberg commented 6 months ago

@ypriverol this would actually be a good entry level task for a student that wants to get into OpenMS/C++

jpfeuffer commented 6 months ago

For both lysc and multi enzymes you will need to give up consensus id compatibility then. A fix for Lysc is just a simple if-case logic in the workflow.

Multi enzymes is a large change in both openms and the workflow. Openms needs to support it in both the data structures and things like indexing. You don't only need support for multiple enzymes but also logic for if they were applied at the same time or after each other. It will probably also not be compatible with an own or a workflow generated decoy databases unless you run multiple searches with different enzymes (and generate one decoy database for each enzyme). You will need to use comet's decoy generation. Therefore it is probably easiest to run comet without the adapter and convert to idxml later on.

timosachsenberg commented 6 months ago

I agree with Julianus that properly modelling multienzyme digestion is adding a lot of complexity. One note: you often see Lys-C/Trypsin combination because it improves cutting after K. From a search engine perspective, the combination can just be treated as Trypsin (or even Trypsin/P) because Lys-C basically cuts at a subset of Trypsin cutting sites. So maybe such complexity is not needed?

ypriverol commented 6 months ago

Im trying to tackle here the first use case which is quite common, the use of another enzyme and not multi-enzyme. Then, it should be easy to extend OpenMS to extend enzymes and support them.

jpfeuffer commented 6 months ago

We could make this workaround for this special case on the workflow level by allowing multi enzymes on workflow level only. Then you would see trypsin/lys-c in the workflow reports and trypsin as far as OpenMS is concerned.

Or we start by adding this special case to OpenMS. (Introducing a new mix enzyme). This would be mainly for reporting reasons then.

ypriverol commented 6 months ago

I don't know why you want to do the mix enzyme. The problem is actually much simpler. We have Lys-C/P which in fact is supported by comet but the Adapter in OpenMS doesn't support it. I want to support it in OpenMS in order to be able to process the dataset that used only Lys-C/P with msgf+ and comet. No mix enzymes.

jpfeuffer commented 6 months ago

Ah ok I completely misread the issue then haha

timosachsenberg commented 6 months ago

Doing Arg-C and Lys-C before trypsin is not an issue.

<ITEM name="RegExDescription" value="Arg-C cuts after R residue unless the next residue is P." type="string" />
<ITEM name="RegExDescription" value="Lys-C cuts after K if not followed by P." type="string" />

but Glu-C as listen on: https://www.ebi.ac.uk/pride/archive/projects/PXD005200 is an issue. It cleaves mainly after E but also after D

timosachsenberg commented 3 weeks ago

I don't know why you want to do the mix enzyme. The problem is actually much simpler. We have Lys-C/P which in fact is supported by comet but the Adapter in OpenMS doesn't support it. I want to support it in OpenMS in order to be able to process the dataset that used only Lys-C/P with msgf+ and comet. No mix enzymes.

Ok that should be easy. You mean similar to https://github.com/OpenMS/OpenMS/pull/7422/files