Open ypriverol opened 1 month ago
I think we have two options plus a combination of both:
I guess if we only do one, I would go for the second option as it is easier.
I am sure we can find a representation that can accommodate the output of all programs. We can implement some heuristics in case some tools do not output gene groups.
Related issue: https://github.com/vdemichev/DiaNN/issues/22
I'm also in favor of jpfeuffer's idea of directly exporting the reported protein group. Also it may be necessary to indicate which tool is reporting on it? Because the grouping algorithm may be different for different tools.
As long a the protein groups, which are in the vast majority of cases (>95%) mapping to isotopes from a single gene, are denoting that the peptide can be found in these proteins, I don't have a strong preference.
If you want a comparable format, the format should not depend on the software (option 1). I agree that option 2 is easier to implement, but then it should be clear for a feature file which program was used to generate the file (+ version).
Currently, the DIANN, Spectronout and other DIA-NN tools at the level of the features release only a protein group by feature/PSM in the same way discussed here https://github.com/vdemichev/DiaNN/issues/22. Each peptide/feature was reported and all the protein accessions were these feature maps.
The problem comes with MQ, FragPipe and other tools that not only support the protein group but report two things:
I prefer the option in the feature and PSM to export all proteins where the feature map is, but I understand that this removes completely the inference of the tool which is also not desirable.
@jpfeuffer what is your take on Yasset's comment. I see the proteins schema file and wonder if inference information could be preserved there? Would that make sense?
How about just another file with the protein groups. And keep the "list all proteins a peptide maps to" approach for the other files. Isn't that basically what is done in mztab?
Or yes, as Timo hinted, add it to the protein file: add another column in the protein table (indicating the group it belongs to), and if it is the master protein of that group or not. If you need information about the group, such as a name, you might want to have a separate file, though.
You will have to think about which queries you want to be fast. Looking up groups etc. can becoming tricky or slow. Or if it is just to have complete information.
One thing that I recently noticed. In theory, if we do not recalculate things ourselves, we might need to add the relevant settings of the software (and version) that was used to the file as well. E.g. I think DIANN has multiple settings related to protein grouping.
Maybe not a problem if people use the whole bundle of outputs from quantms (incl. software/settings report). But maybe it is an issue if someone looks at the pqt files separately.
Back to the inference problem:
What is pending is three columns:
The other option is to leave in columns only what is the very basic information, like all the protein ids that the peptide map and the gene names and then model in a structure the protein group in a struct. That struct will be rarely query and it can contains the start and end positions and also the scores of the proteins, anchor protein etc. That struct can be empty for those datasets with no inference of filled for those structures with protein inference.
We can have something like:
# Inference method for the peptide-protein mapping
"inference_method": STRING, # Method used for protein inference (e.g., Parsimony, Bayesian)
# Nested structure for protein evidence
"pg_evidence": ARRAY<STRUCT<
{
"protein_accession": STRING, # Protein accession for this peptide's match
"start_position": INT, # Start position of the peptide in the protein sequence
"end_position": INT, # End position of the peptide in the protein sequence
"evidence_score": DOUBLE, # Score assigned for this match (e.g., Andromeda score for MaxQuant)
"is_decoy": BOOLEAN, # Indicates if this protein is a decoy
"is_anchor": BOOLEAN, # Highlights if this protein is the anchor protein (the main protein in the group)
}
I think it sounds fair to keep track of the inference result in case this is given. To create a Feature file from MaxQuant this will mean that you need to parse both the Protein Groups.txt
and the evidence.txt
level (precurors) file.
Gene Names: gene group for a peptide maps to Proteins : Protein ID where this a peptide map to in principle protein group IDs : Depending on your settings can assign shared peptides uniquely to majority Protein /~ Group - but that is not a strict setting (just the default?) -> The information can then be found in Protein Groups.txt
Protein group IDs | The identifier of the protein-group this redundant peptide sequence is associated with, which can be used to look up the extended protein information in the file ‘proteinGroups.txt’. As a single peptide can be linked to multiple proteins (e.g. in the case of razor-proteins), multiple id’s can be stored here separated by a semicolon. As a protein can be identified by multiple peptides, the same id can be found in different rows. (source)
For a while, we have been avoiding
Protein Group
modelling in the psm and feature in the format. @jpfeuffer triggered this issue a long time ago. Our main tools DIA-NN, OpenMS TMT, and OpenMS LFQ pipelines handle the Protein groups in different ways.In addition, we are using the feature file as
input
->to ibaqpy
with MaxQuant. Im proposing now to handle this as ProteinGroup and GeneGroup containing as s list of all the proteins and genes where the peptide get mapped https://github.com/bigbio/quantms.io/blob/dev/docs/README.adoc#111-common-peptide-fieldsWould be good for you through your input here. Im also understanding how DIA-NN handle protein inference. In the previous release of quantms, we were taking the Protein.Ids but would like to have this documented. Ideas @jpfeuffer @timosachsenberg @zprobot