Proteobench / ProteoBench

ProteoBench is an open and collaborative platform for community-curated benchmarks for proteomics data analysis pipelines. Our goal is to allow a continuous, easy, and controlled comparison of proteomics data analysis workflows.
https://proteobench.readthedocs.io
Apache License 2.0
27 stars 7 forks source link

protein groups parsing for FragPipe #301

Open mlocardpaulet opened 1 month ago

mlocardpaulet commented 1 month ago

I don't think that we parse correctly the ions in FragPipe. In the .toml, I see this:

[mapper]
"Peptide Sequence" = "Sequence"
Protein = "Proteins"
Charge = "Charge"

So do we completely ignore the field Mapped Proteins? In the combined.tsv table, there are two columns that we need to concatenate to get the protein groups: Proteins and Mapped Proteins. If we only consider the Proteins, we only have one accession from the group. This is not what we do for the other pipelines. This clearly overestimates the quantification error because peptide sequences that match several species will be considered in its calculation.

Here is what I get from the FragPipe output file description (https://fragpipe.nesvilab.org/docs/tutorial_fragpipe_outputs.html#combined_iontsv):

Protein protein sequence header corresponding to the identified peptide sequence; this will be the selected razor protein if the peptide maps to multiple proteins (in this case, other mapped proteins are listed in the ‘Mapped Proteins’ column)

So what we need to do is: get the value from the column Proteins, concatenate with the value of Mapped Proteins, and use this as protein group identifyer. @brvpuyve correct me if I am wrong. For information, when there are several accessions in Mapped Proteins, these are separated by ",".

RobbinBouwmeester commented 1 month ago

Can you share the file, that should help us to debug.

brvpuyve commented 1 month ago

@mlocardpaulet you are completely correct! My apologies for not spotting this sooner.