Sage-Bionetworks / sysbioDCCjsonschemas

SysBio DCC JSON schemas
1 stars 7 forks source link

Update Proteomics template #136

Open pitviper6 opened 2 years ago

pitviper6 commented 2 years ago

Catherine Kaczorowski is going to be uploading mass spec proteomics:

Level 0 ... Raw data
Level 1 ... Skyline documents
Level 2 ... CSV of raw peptide intensities from Skyline
Level 3A ... CSV of Normalized and batch corrected peptide intensities
Level 3B ... CSV of Normalized and batch corrected protein group intensities

Per Michael MacCoss, the proteomics lab PI:

This is data independent acquisition (DIA). TMT has 10-16 samples in each run ... but they then have ~10 runs per "plex". So it makes the meta data linking challenging. Each of a TMT "plex" is a batch.

We have 1 sample = 1 run. The samples are prepared in batches of 16 samples with 14 actual samples and 2 controls. However each run has one sample.
Here are some details of how our data is collected. https://pubmed.ncbi.nlm.nih.gov/32312845/ ... it is probably too much detail but just incase you are interested. A major promise of DIA data is that someone could go back to the RAW data and find something novel after the fact. Here is an example of someone finding something very novel in AD from a human dataset we had that we never considered. https://pubmed.ncbi.nlm.nih.gov/34818016/

A strength of TMT is that many samples are run together. A major challenge of TMT is that the same peptides are rarely sampled between batches. https://www.mcponline.org/article/S1535-9476(20)31525-5/fulltext So it becomes a major challenge to report peptide level data using TMT.

So we do want to try and capture some of the batch information. As it would be useful for someone to perform their own batch correction.

Some info about Keys:

And platform will be Orbitrap Fusion Lumos.

For the Control Type ... we have some internal and external controls. We add several peptides and a protein to each sample that we use as part of the QC process. So those will be part of the data matrix.

We also have between 1-2 samples that are in each batch. So that means the control samples are prepared and run many times. I think for sheet 1 we should add a column for the batch ID.

I'm also a bit confused by the FDR. We probably need a more specific definition of what the FDR threshold is testing. We use many different thresholds (generally done on the q-values) for peptide detection and can also do it for significance testing. We don't do a protein group level q-value ... happy to explain why.

Other info

We just need to make sure that we have the identifiers for the proteomics data and the animals/biospecimens well linked. Any data we might have is something we got from Catherine. It is the best way to minimize errors.

pitviper6 commented 2 years ago

Here is an example of a Level 3A file from a different project. Column headers will be the sample identifiers. Rows are the peptides and the protein(s) they are present. image(4)

pitviper6 commented 2 years ago

Level 3B is similar but we also group peptides together into protein groups. image(5)

pitviper6 commented 2 years ago

We do this using a bipartite graph. Some groups assign peptides to only one group using what is known as a razer peptide approach. We feel like the Razer peptide method is an oversimplification. So you can see that peptides are often assigned to multiple protein groups. image(6)

avanlinden commented 2 years ago

@pitviper6 My takeaway from the above is:

  1. We need a new assay value for DIA proteomics or something similar.
  2. We either need a new assay metadata template, OR we need to combine/rename the proteomics and TMT quantitation templates into one big "mass spec proteomics" template with potential keys for all the different flavors
  3. They need a batchID and possibly more FDR/thresholding values -- they could potentially provide these ad hoc without us adding them to the cictionary

Does that sound right? The methods description stuff they sent (including the bipartite graph figure) will be very helpful to include in the unstructured metadata.

pitviper6 commented 2 years ago

@avanlinden I think having one template is better - it lessens confusion and enforces consistency, although there's a line between 'better' and 'unwieldy'. When I have a bit of time I'll look at the two templates and see how many keys they have.

avanlinden commented 2 years ago

@pitviper6 I agree!

I added "DIA" as an assay value as part of this PR: https://github.com/Sage-Bionetworks/synapseAnnotations/pull/916

pitviper6 commented 2 years ago

Combined proteomics and TMT quantitation templates to create a generalized proteomics template. Will create a RFC for this template and have Catherine's group comment.

GeneralProteomicsTemplate.xlsx