HUPO-PSI / mzTab

mzTab Reporting MS-based Proteomics and Metabolomics Results
https://hupo-psi.github.io/mzTab
39 stars 17 forks source link

Assays of fractionated data needs rework. #26

Open timosachsenberg opened 7 years ago

timosachsenberg commented 7 years ago

One assay is reported for all fractions. This does not allow to model fractionated design with channel swaps between fractions (though I don't know how relevant this is). Example: consider Assay 1 (A1). It is bound to one channel (e.g., 114 of an iTRAQ experiment) A1 iTRAQ reagent 114

At the same time it is bound to 3 ms_runs (one for each fraction) A1 F1,F2,F3

ypriverol commented 7 years ago

@timosachsenberg each fraction can be associated with one or more ms_runs. This means that the basic unit of information is ms_run. Lets said you have one assay A1 associated with 3 fractions like you said. Then you should have three different ms_runs and in the metadata you can associate the assay 1 A1 with the 3 ms_runs like:

MTD assay[1]-ms_run_ref ms_run[1]

We should probably review the cardinality of this relation if we can define more than one. I will review that in the specification.

timosachsenberg commented 7 years ago

Yes, this association is possible. We might need to revisit, e.g., the cardinality of the quantification_reagent of an assay. Maybe just adding one quantification_reagent for every fraction is all it needs. MTD assay[1]-ms_run_ref ms_run[1],ms_run[2],ms_run[3] MTD assay[1]-quantification_reagent [MS,MS:XXX, iTRAQ reagent 114, ]|[MS,MS:XXX, iTRAQ reagent 114, ]|[MS,MS:XXX, iTRAQ reagent 114, ]

timosachsenberg commented 7 years ago

Additional we way need to change how PEP rows are reported. Right now the have quantitative values for assays or study variables. The feature type information they report need to be associated with the fraction (or ms_run) of an assay. One possibility is to introduce an additional column that contains the fraction or ms_run identifier. for example:

PEH sequence ... cv_opt_global_fraction
PEP   VLPASLAANIPVK   sp|P32353|ERG3_YEAST    ...   10808500224     null    null     13553300480     null    null    1
timosachsenberg commented 7 years ago

Just brainstorming... so feel free to ignore... If we want to support different dimensions of fractionation it would probably be better to add a column referencing the ms_run(s). In this case it could look like:

PEH sequence ... cv_opt_global_ms_runs
PEP   VLPASLAANIPVK     ...    1,8,15,23

and as a peptide might be quantified in different fractions, it's probably easiest to duplicate rows.

PEH sequence ... cv_opt_global_ms_runs
PEP   VLPASLAANIPVK     ...    1,8,15,23
PEP   VLPASLAANIPVK     ...    2,9,16,24
...

Sidenote: Currently, rows are (already) duplicated if a sequence maps to multiple protein accessions. As brought up by @andrewrobertjones it might be better to use a single row. For instance, the same semantics as in the PRT section could be used where we have.

PRH  accession  …   ambiguity_members  …
andrewrobertjones commented 7 years ago

Going to merge two issues. this is really the same issue as #28.

I've started some sketching in here: https://github.com/HUPO-PSI/mzTab/blob/master/specification_document/1_1_draft_specs/Version11_design_considerations.pptx.

Will add to it.

Closing #28 now

andrewrobertjones commented 7 years ago

PEH sequence ... cv_opt_global_ms_runs PEP VLPASLAANIPVK ... 1,8,15,23 PEP VLPASLAANIPVK ... 2,9,16,24 ...

This model mentioned above looks sensible to me. This would mean changing the meaning of ms_run (at present it can mean a group of runs) to a single run. Generally this is okay, and would then allow exporters to be explicit about which raw files a given peptide (or SMF for metabolomics) had been observed in, without relying on assay to make the link.

However - would mean a breaking change for proteomics-ID files, since there is num_psms_ms_run[1-n] in the protein table, which I think is really meant to be per assay.

andrewrobertjones commented 7 years ago

One aside around duplication or not of peptides per row, do any workflows merge peptide quants across different fractions and report those (as assay_quant values). This would be done for proteins, but doubt use at the peptide-level?

timosachsenberg commented 7 years ago

"One aside around duplication or not of peptides per row, do any workflows merge peptide quants across different fractions and report those (as assay_quant values). This would be done for proteins, but doubt use at the peptide-level?"

I think it might be useful in some cases (e.g. if only peptides/ligands are quantified). The problem is, imho, that the PEP section would act as representation of linked features - but also as quant. summary of peptides. I think separating these concepts might help. I think a sensible solution for peptide-only quant would be to keep using the PEP section for the feature like representation. The PRT section would still be used to report the final (in this case peptide quants) summarized quantities over assays.

andrewrobertjones commented 7 years ago

Following on from Timo's suggestion, I wonder about the following:

PEH sequence ... opt_ms_run[1]_cv_MS:100XXXX_fraction_IDs opt_ms_run[2]_cv_MS:100XXXX_fraction_IDs

PEP VLPASLAANIPVK ... 1,8,15,23 1,7,16,24 PEP VLPASLAANIPVK ... 2,9,16,24 2,9,24

This could be supported via this route as a backwards compatible update as to how to encode fractions in practice without a schema change, perhaps with the following amendment, so that it is easy to locate the mapping between peptides and fractions

MTD ms_run[1-n]-fraction[1-n]-location LOCATIONOFRAWFILE.raw The metadata addition is not backwards compatible but extra meta-data can be added without really breaking anything, since no readers or writers can handle fractionated data in mzTab 1.1

timosachsenberg commented 7 years ago

I don't know if I understood the opt_ms_run[1]_cv_MS:100XXXX_fraction_IDs opt_ms_run[2]_cv_MS:100XXXX_fraction_IDs part. In my example, you have only one column for all corresponding fractions between runs (because these can be easily linked).

andrewrobertjones commented 7 years ago

Yes on balance my suggestion does seem over-complicated. I was thinking there could be cases where the data had not been aligned with missing features etc, but in that case it could still be linked via your mechanism. I would favour this kind of encoding for the fractions:

MTD ms_run[1-n]-fraction[1-n]-location LOCATIONOFRAWFILE.raw

The using this kind of encoding at the peptide level:

PEH sequence ... cv_opt_global_ms_runs PEP VLPASLAANIPVK ... 1.6; 2.6; 3.6 PEP VLPASLAANIPVK ... 1.7; 2.7; 3:7... So that fractions always stay obviously linked to ms_runs.

timosachsenberg commented 7 years ago

For the meta-data section, we probably need to discuss if/how we adapt:

ms_run[1-n]-format ms_run[1-n]-id_format ms_run[1-n]-fragmentation_method ms_run[1-n]-hash ms_run[1-n]-hash_method

The hash value is inherently file-specific so we would need ms_run[1-n]-fraction[1-m]-hash For the other meta values we need to decide if we want to specify this on a fraction level.

We additionally need to consider if/how to deal with: assay[1-n]-ms_run_ref assay[1-n]-quantification_reagent

This works only if quantification reagents (e.g., iTRAQ/TMT/SILAC "channels") don't change between fractions. But as I don't know of protocols that perform channel swaps between fractions we probably can keep it as is.

timosachsenberg commented 7 years ago

Things we need to consider in protein section:

Right now we report: num_psms_ms_run[1_n] num_peptides_distinct_ms_run[1_n] num_peptides_unique_ms_run[1_n] search_engine_score[1-n]_ms_run[1-n]

with, for example, 25 fractions we would get 100 additional columns per ms_run (!) so I think it doesn't make sense to report these on the fraction level. In that line of thought, @andrewrobertjones also mentioned that these ("spectral count") columns might not be reported at all and we might discuss if these can be transformed into optional columns in future schema updates.

IMHO it is sufficient to have: best_search_engine_score[1-n] MTD protein_search_engine_score[1] [MS,MS:XXX,Protein inference score,] and consider shortening it to score[1-n] and MTD protein_score if we decide for a breaking change in the future


Andy just adding a note to this one, so thread doesn't become overly confused. For version 1.1, we should definitely think about what data to report about ms_runs etc. The usage is confused across ID versus Quant files, since really spectral counts are a type of quant data, which should/could be mapped to assays.

timosachsenberg commented 7 years ago

Just making notes here because it came up when I went through the spec. Make columns: _database databaseversion

optional and add meta values: _protein_database[1-n]-name protein_database[1-n]-version proteindatabase[1-n]-URI _smallmolecule_database[1-n]-name smallmolecule_database[1-n]-version smallmoleculedatabase[1-n]-URI and resolve to which DB a row refers to by the accession

I know this depends on if we want to reduce the number of columns - just wanted to mention it as it was discusses briefly at the PSI meeting

andrewrobertjones commented 7 years ago

This issue was only partially solved in the Aug 2017 workshop for metabolomics. In brief, plan is for assay to reference n ms_runs for fractionated data, rather than having a nesting scheme. The assay tells you that these all belong in a group together (Jul 13 proposal from Timo).

However, still needs a solution for showing which fraction a given molecule was observed in e.g. through use of an optional column. I will add a note to the spec doc on this point.

andrewrobertjones commented 7 years ago

I added a note to the adoc version of the mzTab-M specs indicating how I think this should be done. I guess the same mechanism could be adopted for proteomics, but it would be a breaking change due to the issue mentioned above about having num_psms_ms_run - which is meant to summarise across fractions rather than be per fraction.

timosachsenberg commented 6 years ago

I put here a brief summary about what I think was the current consent and changes needed in the metadata section. Please correct me if I am wrong as I have been out of the loop for some time. I just want to get these things right for the PSI meeting and MzTab-1.1M

We deviate from 1.0 and associate one ms_run with the measurement in the MS after fractionation (and which produces typically one raw/mzML file).

ms_run[1]-location file:///file1.mzML   
ms_run[1]-fraction 1       # additional meta value needed to properly group fractions
ms_run[2]-location file:///file2.mzML
ms_run[2]-fraction 2

This makes a lot of sense to me and I saw that you changed the part "MS run – An MS run is effectively one run (or set of runs on pre-fractionated samples) on an MS instrument, and is referenced from assay in different contexts. " in 1.1-M so we get a one-to-one correspondence between file and run.

Extending ms_run_ref and quantification_reagent to accept lists of references / Params is necessary:

MTD assay[1]-ms_run_ref ms_run[1],ms_run[2],ms_run[3]
MTD assay[1]-quantification_reagent [MS,MS:XXX, iTRAQ reagent 114, ]|[MS,MS:XXX, iTRAQ reagent 114, ]|[MS,MS:XXX, iTRAQ reagent 114, ]

And now clearly specifies, which individual channels (given by quantification_reagent), from which files (and, thus, fractions) have been grouped to yield a single value.