Open pitviper6 opened 2 years ago
Here is an example of a Level 3A file from a different project. Column headers will be the sample identifiers. Rows are the peptides and the protein(s) they are present.
Level 3B is similar but we also group peptides together into protein groups.
We do this using a bipartite graph. Some groups assign peptides to only one group using what is known as a razer peptide approach. We feel like the Razer peptide method is an oversimplification. So you can see that peptides are often assigned to multiple protein groups.
@pitviper6 My takeaway from the above is:
DIA proteomics
or something similar.batchID
and possibly more FDR/thresholding values -- they could potentially provide these ad hoc without us adding them to the cictionaryDoes that sound right? The methods description stuff they sent (including the bipartite graph figure) will be very helpful to include in the unstructured metadata.
@avanlinden I think having one template is better - it lessens confusion and enforces consistency, although there's a line between 'better' and 'unwieldy'. When I have a bit of time I'll look at the two templates and see how many keys they have.
@pitviper6 I agree!
I added "DIA" as an assay value as part of this PR: https://github.com/Sage-Bionetworks/synapseAnnotations/pull/916
Combined proteomics and TMT quantitation templates to create a generalized proteomics template. Will create a RFC for this template and have Catherine's group comment.
Catherine Kaczorowski is going to be uploading mass spec proteomics:
Per Michael MacCoss, the proteomics lab PI:
This is data independent acquisition (DIA). TMT has 10-16 samples in each run ... but they then have ~10 runs per "plex". So it makes the meta data linking challenging. Each of a TMT "plex" is a batch.
We have 1 sample = 1 run. The samples are prepared in batches of 16 samples with 14 actual samples and 2 controls. However each run has one sample.
Here are some details of how our data is collected. https://pubmed.ncbi.nlm.nih.gov/32312845/ ... it is probably too much detail but just incase you are interested. A major promise of DIA data is that someone could go back to the RAW data and find something novel after the fact. Here is an example of someone finding something very novel in AD from a human dataset we had that we never considered. https://pubmed.ncbi.nlm.nih.gov/34818016/
A strength of TMT is that many samples are run together. A major challenge of TMT is that the same peptides are rarely sampled between batches. https://www.mcponline.org/article/S1535-9476(20)31525-5/fulltext So it becomes a major challenge to report peptide level data using TMT.
So we do want to try and capture some of the batch information. As it would be useful for someone to perform their own batch correction.
Some info about Keys:
And platform will be Orbitrap Fusion Lumos.
For the Control Type ... we have some internal and external controls. We add several peptides and a protein to each sample that we use as part of the QC process. So those will be part of the data matrix.
We also have between 1-2 samples that are in each batch. So that means the control samples are prepared and run many times. I think for sheet 1 we should add a column for the batch ID.
I'm also a bit confused by the FDR. We probably need a more specific definition of what the FDR threshold is testing. We use many different thresholds (generally done on the q-values) for peptide detection and can also do it for significance testing. We don't do a protein group level q-value ... happy to explain why.
Other info
We just need to make sure that we have the identifiers for the proteomics data and the animals/biospecimens well linked. Any data we might have is something we got from Catherine. It is the best way to minimize errors.