Protein group - Githubissues

GoogleCodeExporter commented 9 years ago

We should have a way to represent Protein Ambiguity Groups in mzTab. My 
suggestions is that we can add an optional columns with the CVterm MS:1001591 
which is the anchor protein. If we use this way, we will know which is the 
anchor protein for the group and to which group bellows each protein.
Best regards

Original issue reported on code.google.com by ypriverol on 17 Oct 2014 at 3:13

ypriverol commented 8 years ago

@julianu @timosachsenberg @jpfeuffer what do you think?

@timosachsenberg can you share here the @jpfeuffer proposal.

timosachsenberg commented 8 years ago

I think we already have the notion of an anchor protein as there is one dedicated accession and the others are ambiuity members

ypriverol commented 8 years ago

@timosachsenberg My recommendation is to make Protein ID in the protein section not UNIQUE and then every software would be available to: 1 - If the want to represent one protein as anchor protein in multiple groups. 2 - If they want to represent for proteins members of the group their properties like scores, etc. They would be able. If the software do not want to represent this information then all the proteins would be unique.

julianu commented 8 years ago

Just out of curiosity: why should a software use the same anchor protein for multiple groups? Is there any software that does this? Actually I was never happy with anchor or representative proteins, I would rather make the accession a list.

ypriverol commented 8 years ago

@julianu we have some datasets that the inference algorithm reports a protein in a protein group and then the same protein in a sub-group for example. How you can move that to a plain structure if the id of the protein should be unique in mztab.

ypriverol commented 8 years ago

Ideas?

mvaudel commented 8 years ago

@ypriverol I would recommend having the protein group unique, not the representative/anchor/leading protein.

@julianu If you have one peptide shared between protein A and B, and another shared between B and C, you have two protein groups AB and BC. If B is most likely there, it will be the representative/anchor/leading protein of both groups. Accession list is definitely what you want to have as identifier but having representative/anchor/leading proteins are very helpful for the readability :)

Hope this helps!

andrewrobertjones commented 8 years ago

Certainly for reporting quant data, it is essential that you keep one row per protein group in mzTab, otherwise it ruins downstream statistical processing. If same-set or subset proteins are reported on different lines, the quant data will be repeated, leading to incorrect downstream processing and results.

Even for ident data, I think it is better to keep one row per protein group. It is then completely obvious - how many proteins have been identified? Count the rows. This was a mistake we made in mzid 1.1 of not making the distinction between protein accessions and protein groups sufficiently clear. This is an opportunity to get it right for mzTab, so we shouldn't bend the encoding to fit in with one particular software's preferred way of exporting their data.

If you really want to report extra detail about group members, I would recommend keeping a single row (for ident and quant), but then adding a complicated cell at the end contain key-value pairs for all the extra data.

ypriverol commented 8 years ago

@andrewrobertjones @mvaudel

The current implementation of mzTab we RECOMMEND the the proteins group should be reported in this way:

Protein Accession .... Ambiguity members Protein 1 .... Protein 2, Protein 3. .

Protein Accession field MUST be unique and this is the only constrains we made in the format. Then a writer can just give us a file like:

1 - Software writers also report Protein 2, Protein 3..because the file format allow that. Including all the nice information about those proteins scores, sequences, ranks, etc. Then the reader should figure it out whats going on in the file, example: Protein Accession .... Ambiguity members opt_global_cv_MS:1001301_protein_rank Protein 1 .... Protein 2, Protein 3 1 Protein 2 .... NULL 2 Protein 3 .... NULL 3

Then, readers and community in general needs to know what the writers was proposing and also open the field to represent the protein inference information.

If we keep the current specification, we need to restrict this cases, because then would be imposible to handle the files, unless we add CVTerms to verbose all possible combinations.

andrewrobertjones commented 8 years ago

@ypriverol I agree that restricting the format would be a good thing, although I can't envisage how this could be enforced, since anyone can make up a bad encoding if they wish to (same as in mzIdentML). However, it is quite to write a clear guideline that the protein section is for protein groups only, and that only those entities with independent evidence (e.g. under rules of parsimony) should be reported on a new line.

ypriverol commented 8 years ago

@andrewrobertjones this are two different things, if is

bad encoding if they wish to (same as in mzIdentML). then is not mztab compliant

If is posible by the schema, the the reader software should read everything and figure out what type of experiment it is, sometime would be "guess".

julianu commented 8 years ago

My obvious suggestion would be to get rid of the mandatory unicity for the "protein accession" column and support always protein groups. Meaning the introduction of another mandatory row with a unique "group id". At the same time it could be possible to remove "protein accession" and leave only "Ambiguity members". Unless, the proteins in "Ambiguity members" are allowed to be some kind of "sub-proteins" having less evidence and not equal accessions only.

On the other hand: I always considered mzTab as a very simplified (though a bit standardised) format for reports which would be better - and more thouroughly - encoded in mzIdentML. Therefore, i thought mzTab would only give an overview, not the whole truth.

andrewrobertjones commented 8 years ago

It would probably make sense to revisit what mzTab is trying to do overall. Is it now a flattened encoding of everything possible in mzIdentML or mzQuantML, or is it accepted that this is an intentionally lossy encoding that is useful for visualisation (and stats?)?

Julian's suggestion of having a column (presumably not row) for group_ID is one way that this could work - going further down the road of mzTab being a full encoding of the information, but this would not work well for quantitative data - unless null values were placed throughout for every group members other than the group leader/representative protein. To me, one of the most useful cases for mzTab is being able to download or exchange quant data in this format, and load it straight into R. For this to work, all the extra info about group members is largely irrelevant.

jgriss commented 8 years ago

Hi all, When we created mzTab we deliberately did not encode the complete information. mzTab was always only intended to encode final results.

Protein groups can be loosely recorded using the main (reporter) accession column in combination with the ambiguitiy_member column. As pointed out by @andrewrobertjones in down-stream analysis pipelines, you don't really care about anything else.

I therefore strongly suggest to keep mzTab as simple as possible and use mzIdentML / mzQuantML for the complex cases. That's how it's always meant to be. Otherwise, mzTab will become even more complicated to process and will neither contain a correct modelling of all use cases nor will it be easy to parse.

ypriverol commented 8 years ago

@jgriss @andrewrobertjones I FULLY agree with this. However, the RECOMNDATIONS in the mztab specification should be encoded in the file with at least CVTerms to make clear for the users/readers and consumers of the files about the content of the file. For example:

Protein Accession .... Ambiguity members opt_global_cv_MS:1001301_protein_rank Protein 1 .... Protein 2, Protein 3 1 Protein 2 .... NULL 2 Protein 3 .... NULL 3

This example is schema compliant but not the recommended way of reporting the results. Then if a reader arrive and as @andrewrobertjones pointed take the list of proteins as the number of identified proteins, then the results are wrong. My vote is to keep it now as simple as possible AS IT IS, but include some CVTERMs in the header how the user implemented our RECOMENDATIONS.

jgriss commented 8 years ago

@ypriverol First of all, I want to stress that even though your example is valid according to the schema, it is not the way the format should be used.

I personally do not think that it is a good idea to add a mechanism that essentially breaks the main concept of mzTab. Every parser would then have to evaluate these additional cvParams to be sure to know what the reporter protein stands for.

I therefore prefer to adapt the schema specification and explicitly rule these cases out (ie. proteins mentioned as ambiguity members MUST NOT be reported as individual entries - actually, I was under the impression that this was already part of the specification)

andrewrobertjones commented 8 years ago

I'm in agreement with @jgriss on this one. Would be good to rule out the group being reported on multiple lines. Difficult to enforce but at least the spec doc should be written very clearly. I think there is a way to encode extra info about group members in optional columns

ypriverol commented 8 years ago

@andrewrobertjones @jgriss I agree to remove the complexity of Protein Inference from the mztab. However, we should make that clear in the specification. Probably we MUST change the specification in this paragraph:

Instead of SHOULD we can use MUST. The problem @jgriss @andrewrobertjones is that we already have some examples where the writers try to model the protein inference using CVterms, annotation of the proteins inside the proteins groups etc. That is the reason why I'm making this clear here.

@julianu @timosachsenberg @mvaudel are PIA, OpenMS and PeptideShacker on conflict with this?

javizca commented 8 years ago

As @jgriss said, when we developed the format we wanted to simplify the reporting of the protein inference. The current encoding was designed from the very beginning and actually it did not change during the process. I am in favour of keeping the concept behind mzTab as it is now. And the idea was never to replace mzIdentML (apart from the protein inference, mzTab is looking more and more a flattened version of mzIdentML) or mzQuantML (in this case, mzTab is not that comprehensive by far).

However, as it usually happens, life for readers is more complex than for writers. I agree in that some guidelines need to be provided, although this will not avoid the issues of people producing "wrong" files. In PRIDE, we need to be able to interpret the information correctly.

So, basically, in the context of protein groups we have two options:

Keep things like they are now. There is one mechanism to "avoid" the fact that the protein accession is unique, by adding [1], [2], .... after the accession number, if this is needed. This was added for quantification purposes mainly (the case explained by @mvaudel but also if different proteoforms were reported), but it can only be applied to identification. Make clear in the guidelines that only one anchor protein and the corresponding ambiguity members need to be reported per row, and avoid the rest of the complexity. The format is lossy in that respect. There is not the need to change the specification, but maybe create a version 1.0.1, amending that paragraph highlighted by Yasset, and adding a new section to clarify this in detail. Of course, there is no way to enforce this in practise, but as not too many people are writing the files yet, I think that we could probably manage that most people would write it in the right way.
If after some time, we see that this is not enough, and there is the need to support Protein Groups, as Andy mentioned before, a new section just for Protein Groups could be added. That extra section would solve properly the problems related to the modelling of protein inference, but the changes would need to be agreed, it would take some time, etc etc.

jgriss commented 8 years ago

I second @javizca suggestion. Keep it simple as it is and add a proper section ones it's needed.

ypriverol commented 8 years ago

@jgriss @javizca We should leave it at it is now. The only problem is that schema should respond to this or at least reflect that. I agree that we MUST report only the interesting proteins in mztab and leave in mzidentml the rest. However, I guess the part I highlighted in the specification MUST be changed to reflect that. For example, I have this Mascot mztab file prototype which is completely mztab compliant:

https://github.com/PRIDE-Toolsuite/inspector-example-files/tree/master/mztab

They produce a valid file but not the one RECOMMENDED one. It is difficult to implement a parser or reader that can realize about this change. Then, my suggestion is that we change in the specification (which looks like a simple change but will make more consistent the file format):

Current version: Page 14

It is RECOMMENDED that “subset proteins” that are unlikely to have been identified SHOULD NOT be reported here.

Change to:

The “subset proteins” that are unlikely to have been identified MUST NOT be reported here as individual protein rows.

javizca commented 8 years ago

We can then change the specification document to version 1.0.1 (I guess is this needed) to change the phrasing of that paragraph that you mention. Also the change should be highlighted somewhere else in the specification document in a section called "Changes from version 1.0" or similar.

andrewrobertjones commented 7 years ago

I made the following change to the 1.1-draft doc: “Subset proteins” that are unlikely to have been identified MUST NOT be reported as additional rows.

andrewrobertjones commented 7 years ago

In fact, that change doesn't capture the bigger problem of same-set proteins being reported across multiple rows. I have made the following change to the draft specs:

“Subset proteins” that are unlikely to have been identified SHOULD NOT be reported in ambiguity_members. More generally, it is important the count of rows in the Protein table matches the number of proteins claimed to have been identified / quantified, and thus multiple accessions that do not have independent evidence MUST NOT be reported on separate rows.

I think this captures the major problems encountered.

timosachsenberg commented 7 years ago

Sounds good. Does the last phrase allow for the same Phosphoprotein (different mod. isoform) reported in separate rows? If yes, I think we should explicitly state this there.

andrewrobertjones commented 7 years ago

Good point, I added this to the end of that section:

As detailed in the accession attribute of the Protein table (Section 6.3.1), separate rows can be used to encode different proteoforms (e.g. where differentially modified forms of a protein have been quantified by top down methods) from the same database accession.

HUPO-PSI / mzTab

Protein group #20