bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
78 stars 108 forks source link

Annotation questions (general & bacteria specific) #667

Closed Cajac102 closed 1 year ago

Cajac102 commented 1 year ago

Hey,

I'm currently annotating bacterial projects from PRIDE and ran into a few questions along the way. Since these are the first SDRFs I'm writing, and I'd like to stick to the protocol as closely as possible, I'd be glad if someone could help me out here :)

General: -How to annotate multiple cleavage agents, e.g. in projects that use LysC + Trypsin? It's recommended to merge them into one column, but then there's no EFO term for the combination. -How to annotate samples with >1 collision energy settings? In the annotation example of PXD021874 it's done with semicolons ("25%; 30%; 35% NCE"), but I couldn't find it in the documentation. Also, there are examples with and without the "%", which one is preferred for NCE values? -For the precursor tolerance, what to annotate in case of first and main searches? Both values, or only the one for the first or main search? -How to annotate samples that have been put through affinity purification before? Is purification == enrichment, and therefore it belongs in characteristics[enrichment process]?

More specific for bacteria: -How to annotate samples taken at different points in time? Similar to age, such as "3H", "6H", "9H"? Is there a fixed term for the characteristics column? -How to annotate genetically engineered strains (CRISPr/Plasmids)? Is that something for the "strain" column, or should it be in a separate one (how to call it)?

I understand that users are allowed to create their own columns, and I feel like in the case of bacteria, there are a few parameters that will occur regularly but could potentially be assigned to different columns/called differently (strain & substrain, growth stage vs. development stage, growth medium vs. medium vs. treatment etc). Would this already justify a separate template in addition to the "non vertebrate" one or would this overcomplicate it for info that is not crucial?

-In the example of PXD017035, there's a column called "material type" where all the values are "bacterial strain". It's not explained in the documentation, and I could not quite untangle the info I found in other issues. Is it a deprecated feature?

Cheers, Caro

StSchulze commented 1 year ago

Hi Caro,

I'm curious about the answers for your first questions as well, especially regarding the enzymes and precursor tolerance (for the collision settings, you saw how I annotated it for PXD021874, but I don't know if that's really the preferred way)

For the purification process, I would say affinity purification counts as enrichment, yes.

With regard to bacteria-specific questions: I would use the "characteristics[growth condition]" column for the time points. And the "strain" column is the best place for the genetic information, in my opinion, including CRISPR/plasmids.

Those are just my thoughts, though, so I'm curious to hear what others think. Having a separate template for prokaryotes wouldn't hurt, in my opinion.

Best, Stefan

ypriverol commented 1 year ago

Dear Caro: Sorry for my late reply. Here, are my suggestions:

Hey,

I'm currently annotating bacterial projects from PRIDE and ran into a few questions along the way. Since these are the first SDRFs I'm writing, and I'd like to stick to the protocol as closely as possible, I'd be glad if someone could help me out here :)

General: -How to annotate multiple cleavage agents, e.g. in projects that use LysC + Trypsin? It's recommended to merge them into one column, but then there's no EFO term for the combination.

We recommended splitting into two columns, because as you mentioned, not an ontology term is available for the combination of the two Enzymes. Also, it is easy to read and understand. Please, if you can point me to the place in the specification when the recommendation was done to merge both enzymes, I can make a correction on that.

-How to annotate samples with >1 collision energy settings? In the annotation example of PXD021874 it's done with semicolons ("25%; 30%; 35% NCE"), but I couldn't find it in the documentation. Also, there are examples with and without the "%", which one is preferred for NCE values?

As you pointed out, some of the recommendations are not validated in by the python validator, which means that you can find some examples with the NCE and others no. We can include some of your recommendation into the validation. I think you should add them as the recommendation suggested: 25%; 30%; 35% NCE

-For the precursor tolerance, what to annotate in case of first and main searches? Both values, or only the one for the first or main search?

When you have two searches you open first the search space, and then close the search in the second search. I suggest adding the more stringent tolerances, the ones closer to the result tolerances.

-How to annotate samples that have been put through affinity purification before? Is purification == enrichment, and therefore it belongs in characteristics[enrichment process]?

Yes, affinity purification is considered an enrichment because only certain proteins/peptides are more detectable than others.

More specific for bacteria: -How to annotate samples taken at different points in time? Similar to age, such as "3H", "6H", "9H"? Is there a fixed term for the characteristics column? -How to annotate genetically engineered strains (CRISPr/Plasmids)? Is that something for the "strain" column, or should it be in a separate one (how to call it)?

Timelines can be captured using the OLS time (https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000721). Like: characteristics[time] -> 2 hour

I understand that users are allowed to create their own columns, and I feel like in the case of bacteria, there are a few parameters that will occur regularly but could potentially be assigned to different columns/called differently (strain & substrain, growth stage vs. development stage, growth medium vs. medium vs. treatment etc). Would this already justify a separate template in addition to the "non vertebrate" one or would this overcomplicate it for info that is not crucial?

You are free to use and define the columns that you want and as much metadata you put the better. Remember, each column needs to be a valid ontology term from the ontologies supported. We can add more templates in the future. However, the purpose of the templates is the minimum metadata needed to understand the class of the experiment they represent. We haven't added more templates because it can explode quickly and make the specification more complex and difficult to understand.

I think we should motivate more the community to incorporate and create more templates. If you are OK, I will include your name in the repo, and we can start triggering the discussion about extending templates.

-In the example of PXD017035, there's a column called "material type" where all the values are "bacterial strain". It's not explained in the documentation, and I could not quite untangle the info I found in other issues. Is it a deprecated feature?

Actually, Material Type is a compatible column with RNA-Seq and other omics experiments (https://tab2mage.sourceforge.net/docs/sdrf.html#material_type). We really didn't standardize what to put in that column and it is now used to annotate free text of the material type of the sample.

Cheers, Caro

Thanks a lot for your questions and please feel free to replied and continue with this discussion.

Cajac102 commented 1 year ago

Dear Stefan, dear Yasset, Thanks for the extensive answers!

And the "strain" column is the best place for the genetic information, in my opinion, including CRISPR/plasmids.

That's also how I feel, the only problem I am currently seeing is the case where you want to use the strain information to select the target database. But perhaps an additional column with the uniprot proteome ID would be a better idea anyways.

Please, if you can point me to the place in the specification when the recommendation was done to merge both enzymes, I can make a correction on that.

This is how I interpreted the statement that "it is RECOMMENDED not to use the same column in the same file", but when I searched for that in the documentation just now I realised that it related to the characteristics columns, not the comments, so that was a mistake from my part.

However, the purpose of the templates is the minimum metadata needed to understand the class of the experiment they represent. We haven't added more templates because it can explode quickly and make the specification more complex and difficult to understand.

I can totally understand that. Better have minimal annotation than none at all!

I think we should motivate more the community to incorporate and create more templates. If you are OK, I will include your name in the repo, and we can start triggering the discussion about extending templates.

Of course! I'd be happy to help. I think a template for prokaryotes would be helpful to make SDRFs more comparable even when they are annotated by different people.

Cheers, Caro