bigbio / proteomics-sample-metadata

The Proteomics Experimental Design file format: Standard for experimental design annotation
GNU General Public License v2.0
75 stars 106 forks source link

New Example file from three PX projects #20

Closed ypriverol closed 4 years ago

ypriverol commented 4 years ago

Hi all:

I just annotate three different projects from PRIDE (1000 raw files) more than 59 cell lines. Here the example https://github.com/bigbio/multiomics-configs/blob/master/datasets/NCI60/sdrf.tsv . We should be able to start from here reanalysis.

Feedback more than welcome @lgatto @mvaudel @mwalzer @trishorts @StSchulze @hbarsnes

trishorts commented 4 years ago

Need to understand the sheet more. Why is there only one protein accession. Also would be good to know precursor tolerance. What about sample treatments? What about enrichment? Are these whole cell lysates or phosphopeptide enrichment? do you have a way to specify top-down/bottom-up? Other MS data like orbitrap v. iontrap v. tof. Ion mobiliity is getting good. do we know that? What about DIA and DDA. Maybe this is all in another file. How do i know?

ypriverol commented 4 years ago

Hi @trishorts:

Here some ideas about the tab-delimited file. These annotations correspond to projects PXD005940, PXD005942, PXD005946, manuscript http://europepmc.org/abstract/MED/23933261 from @kuster lab and @mwilhelm42. Reply to somoe of your questions:

Why is there only one protein accession.

The file do not contain accessions to proteins only experimental design metadata.

Also would be good to know precursor tolerance.

This is a really good point, however, the authors of the paper used the MaxQuant precursor tolerance algorithm: "Mass accuracy of the precursor ions was determined by the time-dependent recalibration algorithm of Maxquant". We need to decide, what to put here, a default precursor tolerance for the instrument or leave it empty?

What about sample treatments?

What kind of sample treatments is needed for your tool in order to process the RAW files?

What about enrichment? Are these whole-cell lysates or phosphopeptide enrichment?

If the sample is not enriched, then we will not put any term about that. Or do you think we need to add the term whole-cell lysates explicitly?

do you have a way to specify top-down/bottom-up?

Should we add this term also. ?

Other MS data like orbitrap v. iontrap v. tof. Ion mobility is getting good. do we know that? What about DIA and DDA.

Do we need to define the minimun terms to be able to reprocess/reanalyze the data?

Maybe this is all in another file. How do I know?

I don't want to add another file close to this because it would be complex for the tool to be able to read more than one file. I like the idea of having one big tab-delimited file with "as much as possible" metadata that enables to represent the experimental design.

ypriverol commented 4 years ago

As discussed by @jgriss I have included the PRIDE uri for each raw file https://github.com/bigbio/multiomics-configs/blob/master/datasets/NCI60/sdrf.tsv

StSchulze commented 4 years ago

I'm not exactly sure if the aim is to reflect the 1) sample preparation + MS measurement in order to have all information needed for reanalysis or 2) search parameters of the original analysis in order to reproduce the search results (or reanalyze/modify/...)

In case of 1) I'm missing the following:

In case of 2) I think there might be even more, but some fairly important ones:

magnuspalmblad commented 4 years ago

I agree resolving power (a.k.a. "resolution") is great to have, but is m/z-dependent (especially in FTMS) and different vendors use different definitions. One well-known manufacturer even changes theirs depending on what their marketing people want, sometimes m/z 400, sometimes 200...). The definition must be specified for the number to be meaningful.

Mass measurement uncertainty is limited by resolving power, but also depend on calibration and other factors. I think the most useful simple metric is the global mass measurement uncertainty for MS1 and MS2, as these inform the mass measurement error tolerance set in the peptide identification. As with resolving power, it is not always easy for the data submitters to calculate these, but we know how to.

Resolving power and mass measurement uncertainty are properties of (metadata if you will) the MS data, the mass measurement error tolerance is metadata for the peptide IDs, right?

I agree on the other points as long we are clear what is metadata for the MS data (or inherent properties thereof) and what is search/ID metadata.

ypriverol commented 4 years ago

Hi @StSchulze:

Some ideas..

I'm not exactly sure if the aim is to reflect the

  1. sample preparation + MS measurement in order to have all information needed for reanalysis or 2) search parameters of the original analysis in order to reproduce the search results (or reanalyze/modify/...)

The aim is to recreate as much as possible the experimental design to reanalyze and understand the experiment. Not to reproduce the original analysis. Reproduce the original analysis is a more complex problem because you will need software specific parameters, etc. For example in this submission, the users use Precursor tolerance estimation from MaxQuant, this is really specific to software. However, adding the "common" Precursor tolerance for the instrument can enable other software to reanalyze the data.

In case of 1) I'm missing the following:

  • MS1 and MS2 resolution. I think this would be more useful than precursor/fragment mass tolerance, since different users might set them quite differently for the same resolution. It would also get around the MaxQuant problem.

I like the idea of adding MS1 and MS2 resolution, but is this parameter used by most of the search engines (I'm asking because you know more the search engines more than me). Most of the search engines use Precursor and Fragment tolerances.

  • scan range
  • rejected/accepted charge states

Agree about the charge states. I will try to include them as "Recommended". Does the scan range is used?

  • defining the chemical treatment rather than giving a searched modification might be more suitable

What is a chemical treatment? How do you think can be represented using CVterm from OLS and the current key-value pairs structure.

In case of 2) I think there might be even more, but some fairly important ones:

  • used database
  • charges
  • peptide length
  • searched ions (x/y/z, a/b/c, etc)
  • used search engine(s)

I will add all these features or properties as "Recommended" but from my point of view (but you know more than me), most of the search engines use default options are we can get good results with them?

BTW, Thanks @StSchulze for your feedback. My aim is that we can pull this file with your tool and reanalyze a dataset a push back the results to PRIDE in a semi-automatic way.

ypriverol commented 4 years ago

Hi @magnuspalmblad, first of all thanks for the feedback.

I agree resolving power (a.k.a. "resolution") is great to have, but is m/z-dependent (especially in FTMS) and different vendors use different definitions. One well-known manufacturer even changes theirs depending on what their marketing people want, sometimes m/z 400, sometimes 200...). The definition must be specified for the number to be meaningful.

I agree with you, and that is the main reason why we should standardize the way we add resolutions taking into account How search engines represent them? Because I want to provide a parameter there that can be useful for search engines and tools for reanalyzis.

Mass measurement uncertainty is limited by resolving power, but also depend on calibration and other factors. I think the most useful simple metric is the global mass measurement uncertainty for MS1 and MS2, as these inform the mass measurement error tolerance set in the peptide identification. As with resolving power, it is not always easy for the data submitters to calculate these, but we know how to.

Agree.

Resolving power and mass measurement uncertainty are properties of (metadata if you will) the MS data, the mass measurement error tolerance is metadata for the peptide IDs, right?

The current parameters reflect what the users have added to their Protocol section in the paper. I know the final parameter depends more on the id results but this is never reported and difficult to get without going into the ID files.

I agree on the other points as long we are clear what is metadata for the MS data (or inherent properties thereof) and what is search/ID metadata.

Agree

trishorts commented 4 years ago

Also would be good to know precursor tolerance.

This is a really good point, however, the authors of the paper used the MaxQuant precursor tolerance algorithm: "Mass accuracy of the precursor ions was determined by the time-dependent recalibration algorithm of Maxquant". We need to decide, what to put here, a default precursor tolerance for the instrument or leave it empty?

Happy to get either tolerance or resolution (as one other commenter stated). I'm not as worried if its defined at 200/400 mz but I want to know the ballpark. MS1 tolerance is important because we re-donvolute and look for co-isloated species. MS2 tolerance matters because we search differently for high and low res data.

What about sample treatments?

What kind of sample treatments is needed for your tool in order to process the RAW files? Labelling is the biggest (tmt, itraq, dileu, etc). But also knowing if sample was reduced and alkylated. If alkylated, with what?

What about enrichment? Are these whole-cell lysates or phosphopeptide enrichment?

If the sample is not enriched, then we will not put any term about that. Or do you think we need to add the term whole-cell lysates explicitly?

adding a term only if there was enrichment seems good.

do you have a way to specify top-down/bottom-up?

Should we add this term also. ?

We should allow someone to specify top-down for sure. Could be no term for assumed bottom up.

Other MS data like orbitrap v. iontrap v. tof. Ion mobility is getting good. do we know that? What about DIA and DDA.

Do we need to define the minimun terms to be able to reprocess/reanalyze the data?

Knowing if the data is DIA/DDA/ion-moblility certainly matters. you can skip the instrument info if tolerances or resolutions are supplied.

Maybe this is all in another file. How do I know?

I don't want to add another file close to this because it would be complex for the tool to be able to read more than one file. I like the idea of having one big tab-delimited file with "as much as possible" metadata that enables to represent the experimental design.

trishorts commented 4 years ago
* used database

This is good, though it's very hard to capture. Would look something likde "UniProt Homo Sapiens Reviewed Canonical Downloaded 10/18-2019)