bigbio / proteomics-sample-metadata

The Proteomics Experimental Design file format: Standard for experimental design annotation
GNU General Public License v2.0
75 stars 106 forks source link

How to indicate that a raw file contains mixed peptides of different species #364

Open daichengxin opened 4 years ago

daichengxin commented 4 years ago

In the PXD014525 project, the steps in the following figure are used. A raw file contains a mixed protein of Yeast and HELA. And in different raw files, there are different dilution ratios. How should we properly represent such experiments?

image

ypriverol commented 4 years ago

Nothing to stop you to put two organisms in the same sample. You can use also an original strategy:

Sample 1, Human, pooled 1, ms run 1,.. Sample 2, Yeast, pooled 1, ms run 1,...

Here pooled is used to said that two samples are in the same pool.

What do you think @levitsky

levitsky commented 4 years ago

Yes, in accordance with our discussion about pooled samples in #223 and specification changes merged in #363, if individual samples can be annotated, then it is better to do it. Looks like the consensus is that these annotations should be repeated for each pooled sample: two rows (Yeast and HeLa) for each of the five mix ratios (per figure), for each experiment type (DDA and DIA). Since every sample goes in its own row, you cannot directly specify mix ratio. Instead I would try to use "sample amount", a term that we need anyway and which is currently discussed in #328.

@ypriverol I'm not sure we need any "pool ID" like pool 1, at least it wasn't discussed before. I would say we don't need to explicitly say the sample is pooled, we have assay name that ties them together:

source name characteristics[organism] characteristics[sample amount]* assay name
Sample 1 Homo sapiens 1 ug ms run 1
Sample 2 Saccharomyces cerevisiae 2 ug ms run 1

*term not finalized yet

This is repeated for every mix with the same source names and other info but changing sample amounts.

ypriverol commented 4 years ago

In this sense, we are having multiple samples without labeling (label-free). Therefore, I suggest using some columns to specify that both samples are merged in the same file. For example, if I took the row from this file to merge with another sdrf, I would like to know that in that msrun you have some other sample, with pool ID or pooled Sample you can do that.

daichengxin commented 4 years ago
Source Name Characteristics[organism] Characteristics[organism part] Characteristics[developmental stage] Characteristics[disease] Characteristics[age] Characteristics[sex] Characteristics[ancestry category] Characteristics[cell type] Characteristics[cell line] Characteristics[compound] Characteristics[compound] Characteristics[concentration of] Characteristics[biological replicate] Characteristics[enrichment process] Characteristics[synthetic peptide] Characteristics[sample amount] Characteristics[assay name] comment[data file]
Sample 288 Saccharomyces cerevisiae not applicable not available not available not available not available not applicable not applicable not applicable Phosphatase none not applicable not applicable enrichment of phosphorylated Protein not synthetic 2 μg run 352 20190122_QE5_nLC5_AH_LFQstoich_DIA_A_01.raw
Sample 289 Saccharomyces cerevisiae not applicable not available not available not available not available not applicable not applicable not applicable Mock none not applicable not applicable enrichment of phosphorylated Protein not synthetic 198 μg run 352 20190122_QE5_nLC5_AH_LFQstoich_DIA_A_01.raw
Sample 290 Homo sapiens not applicable adult epithelial cervix carcinoma 31Y female Black epithelial cell Hela Phosphatase none not applicable not applicable enrichment of phosphorylated Protein not synthetic 50 μg run 352 20190122_QE5_nLC5_AH_LFQstoich_DIA_A_01.raw
Sample 291 Homo sapiens not applicable adult epithelial cervix carcinoma 31Y female Black epithelial cell Hela Mock none not applicable not applicable enrichment of phosphorylated Protein not synthetic 50 μg run 352 20190122_QE5_nLC5_AH_LFQstoich_DIA_A_01.raw

Hi Yasset, this is part of my annotations now. Do you mean I need to add some other columns to specify that both samples are merged in the same file, such as column 'pool' ?

levitsky commented 4 years ago

@ypriverol I'm not sure I understand what problem your suggestion is solving. If you include both HeLa and Yeast rows in some other SDRF and they have some explicit annotation as pooled, then how do you know from that annotation that no other sample was mixed in with them in the original SDRF? I think you just don't. Taking one row out of an SDRF table will not give anyone an adequate idea of the experiment, only the full table can do that.

Plus, the sample annotation in any given row should contain characteristics of that particular sample. "Pool ID" or similar stuff do not characterize the sample itself. It could be a comment but again, if it simply repeats assay name, I don't see the benefit. Although probably no harm, either.

baimingze commented 3 years ago

Hi @ypriverol and @levitsky , after a discussion with @daichengxin ,we found that with a column to store the index of pool will be easier for the programmer/user who want to merge two sdrf files.

Here is the reason: for some Labeled projects, the "Assay name" could be same when one sample is labeled by different tag, but they are not pooled(if I understand the "pool" correctly), which means same "Assay name" could not be used as the standard to tell if they are in same pool. If the user want to find the other samples in same pool, a more complicated program is needed.

For example, for row 2 and row 16 in https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD006877/sdrf.tsv#L16, they have same Assay Name(Run 1), but are labeled by SILAC light K:12C(6)14N(2) and SILAC heavy K:13C(6)15N(2), they are not pooled but have same Assay Name. If the user want to find other pooled samples, they have to exclude this situations, which could be more complicate.

levitsky commented 3 years ago

Hi @baimingze, indeed, without such column the way to find out if the samples are "pooled" in a meaningful way would probably be to check for equality in both assay name and comment[label]. I am personally not sure that this would be too complicated to check and required a separate column, but like I said above, I'm fine with it in the form of an optional (perhaps recommended) comment.