How to store protein groups and gene groups in feature and psms files.

ypriverol commented 1 month ago

For a while, we have been avoiding Protein Group modelling in the psm and feature in the format. @jpfeuffer triggered this issue a long time ago. Our main tools DIA-NN, OpenMS TMT, and OpenMS LFQ pipelines handle the Protein groups in different ways.

In addition, we are using the feature file as input -> to ibaqpy with MaxQuant. Im proposing now to handle this as ProteinGroup and GeneGroup containing as s list of all the proteins and genes where the peptide get mapped https://github.com/bigbio/quantms.io/blob/dev/docs/README.adoc#111-common-peptide-fields

Would be good for you through your input here. Im also understanding how DIA-NN handle protein inference. In the previous release of quantms, we were taking the Protein.Ids but would like to have this documented. Ideas @jpfeuffer @timosachsenberg @zprobot

jpfeuffer commented 1 month ago

I think we have two options plus a combination of both:

define a simple grouping ourselves and use that for all programs
output whatever the program reports as grouping

I guess if we only do one, I would go for the second option as it is easier.

I am sure we can find a representation that can accommodate the output of all programs. We can implement some heuristics in case some tools do not output gene groups.

ypriverol commented 1 month ago

daichengxin commented 1 month ago

I'm also in favor of jpfeuffer's idea of directly exporting the reported protein group. Also it may be necessary to indicate which tool is reporting on it? Because the grouping algorithm may be different for different tools.

enryH commented 1 month ago

As long a the protein groups, which are in the vast majority of cases (>95%) mapping to isotopes from a single gene, are denoting that the peptide can be found in these proteins, I don't have a strong preference.

If you want a comparable format, the format should not depend on the software (option 1). I agree that option 2 is easier to implement, but then it should be clear for a feature file which program was used to generate the file (+ version).

ypriverol commented 1 month ago

Currently, the DIANN, Spectronout and other DIA-NN tools at the level of the features release only a protein group by feature/PSM in the same way discussed here https://github.com/vdemichev/DiaNN/issues/22. Each peptide/feature was reported and all the protein accessions were these feature maps.

The problem comes with MQ, FragPipe and other tools that not only support the protein group but report two things:

Anchor protein with the given score etc.
Other proteins in the groups, no scores for them (in most cases).

I prefer the option in the feature and PSM to export all proteins where the feature map is, but I understand that this removes completely the inference of the tool which is also not desirable.

timosachsenberg commented 1 month ago

@jpfeuffer what is your take on Yasset's comment. I see the proteins schema file and wonder if inference information could be preserved there? Would that make sense?

jpfeuffer commented 1 month ago

How about just another file with the protein groups. And keep the "list all proteins a peptide maps to" approach for the other files. Isn't that basically what is done in mztab?

Or yes, as Timo hinted, add it to the protein file: add another column in the protein table (indicating the group it belongs to), and if it is the master protein of that group or not. If you need information about the group, such as a name, you might want to have a separate file, though.

jpfeuffer commented 1 month ago

You will have to think about which queries you want to be fast. Looking up groups etc. can becoming tricky or slow. Or if it is just to have complete information.

jpfeuffer commented 1 month ago

One thing that I recently noticed. In theory, if we do not recalculate things ourselves, we might need to add the relevant settings of the software (and version) that was used to the file as well. E.g. I think DIANN has multiple settings related to protein grouping.

Maybe not a problem if people use the whole bundle of outputs from quantms (incl. software/settings report). But maybe it is an issue if someone looks at the pqt files separately.

ypriverol commented 1 month ago

General remarks

I don't want to verbose the entire file for settings, etc. They have the logs from the software and the original output files; they made that mistake in mzTab, and nobody uses it. The metadata section has hundreds of lines. This format is a data science-oriented file.
Another anecdotal thing: Nobody uses the inference part of mzIdentML, if you read all papers that use mzIdentML, 90% are using the PSM section.
I did add the software and version of the software that generates the data. Like MaxQuant and version or quantms and version.

What is clear

Back to the inference problem:

I do think is important to capture as much as possible the important and useful information a leave out the non-relevant info.
Protein descriptions, for that the user can go to the fasta file, then is not needed, at least not in the feature, PSM files.

Current structure based on columns:

We have two structures now:
- pg_accessions: protein groups
- gg_accessions: gene group accessions
- gg_names: gene name accessions. This capture is a feature (peptide map to which proteins - and the generator is free to add here all the proteins where it maps.

What is pending is three columns:

pg_scores: where we store the score of the protein in the protein groups.
gg_scores: where we store the gene score.
pg_anchor: anchor protein for the group.

Struct for inference

The other option is to leave in columns only what is the very basic information, like all the protein ids that the peptide map and the gene names and then model in a structure the protein group in a struct. That struct will be rarely query and it can contains the start and end positions and also the scores of the proteins, anchor protein etc. That struct can be empty for those datasets with no inference of filled for those structures with protein inference.

We can have something like:

# Inference method for the peptide-protein mapping

"inference_method": STRING,               # Method used for protein inference (e.g., Parsimony, Bayesian)

# Nested structure for protein evidence
"pg_evidence": ARRAY<STRUCT<
    {
      "protein_accession": STRING,          # Protein accession for this peptide's match
      "start_position": INT,                         # Start position of the peptide in the protein sequence
      "end_position": INT,                           # End position of the peptide in the protein sequence
      "evidence_score": DOUBLE,             # Score assigned for this match (e.g., Andromeda score for MaxQuant)
      "is_decoy": BOOLEAN,                      # Indicates if this protein is a decoy
      "is_anchor": BOOLEAN,                     # Highlights if this protein is the anchor protein (the main protein in the group)
    }

enryH commented 1 month ago

I think it sounds fair to keep track of the inference result in case this is given. To create a Feature file from MaxQuant this will mean that you need to parse both the Protein Groups.txt and the evidence.txt level (precurors) file.

evidence.txt

Gene Names: gene group for a peptide maps to Proteins : Protein ID where this a peptide map to in principle protein group IDs : Depending on your settings can assign shared peptides uniquely to majority Protein /~ Group - but that is not a strict setting (just the default?) -> The information can then be found in Protein Groups.txt

Protein group IDs | The identifier of the protein-group this redundant peptide sequence is associated with, which can be used to look up the extended protein information in the file ‘proteinGroups.txt’. As a single peptide can be linked to multiple proteins (e.g. in the case of razor-proteins), multiple id’s can be stored here separated by a semicolon. As a protein can be identified by multiple peptides, the same id can be found in different rows. (source)

bigbio / quantms.io