bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
76 stars 107 forks source link

LFQBenchmark experiment - multiple organisms #568

Open brvpuyve opened 3 years ago

brvpuyve commented 3 years ago

Hi everyone,

I generated an updated LFQbenchmark dataset, similar to the one from Navarro et al. (https://pubmed.ncbi.nlm.nih.gov/27701404/). I was wondering how I could best annotate the mixtures (as pooled samples)? Can I mention more than one organism in the characteristics[organism] column? Additionally, would it be beneficial to add an additional comment section to define the ratio's of the three proteomes?

Looking forward to your suggestions!

Best,

Bart Van Puyvelde

mlocardpaulet commented 3 years ago

Hi Bart, did you get any help with this? I suspect you could use the field characteristics[pooled sample] and list in it all the samples that are pooled (SN=sample 1,sample 2, …​ sample 9 were "sample n" is the value of the corresponding sample in source name). For the relative quantities I am not sure. Others may have better ideas. Maybe you could use the key QY= to indicate relative quantity (like in characteristics[spiked compound]), but I am not sure how to make the sample names correspond to the quantities.

mlocardpaulet commented 3 years ago

Also, I don't know how to do if one of the pooled samples is not analysed alone (so there is no .raw file associated to one of the sample names).

ypriverol commented 3 years ago

Hi @mlocardpaulet @brvpuyve :

First, my apologies for the late reply, I was OFF for a couple of weeks. I was discussing a some weeks ago about with @anjaf about how to represent multiplexed samples in an experiment.

We have two options here:

1- Represent each sample as an independent sample, adding a characteristics to the sample called characteristics[concentration of] and link each sample to the same data file. The characteristics[organism] will be different for each sample. This is actually a clean representation because each sample has its own row and can be represented with more characteristics. It has differences with the current pooled approach mentioned by @mlocardpaulet because in the pooled approach samples are used multiple times in their corresponding channel + in the pooled.

It will be something like:

source name characteristics[organism] characteristics[organism part] characteristics[biological replicate] characteristics[concentration of] assay name comment[technical replicate] comment[fraction identifier] comment[label] comment[data file] characteristics[concentration of]
Sample-1 homo sapiens heart 1 70% ms_run 1 1 1 label free sample file1.raw 70%
Sample-2 e coli liver 1 60% ms_run 1 1 1 label free sample file1.raw 60%

As you can see the assay name is the same meaning that the file and the label conditions are the same.

2- @anjaf mentioned before the idea of having an characteristics[organism] called mixed, then we can represent all the species in the sample in the characteristics[pooled sample] as key values pairs with concentrations.

Would be great to have your opinion @anjaf @jgriss @mvaudel @mlocardpaulet @all @bigbio/collaborators

mlocardpaulet commented 3 years ago

Hi @ypriverol thanks a lot. I like option 1- very much. So to be clear: there will be duplicated file names?

brvpuyve commented 3 years ago

Option 1 is maybe the best approach although it will be some work for me to add the extra lines :-) Let me know what is decided and I will create the SDRF's.

Thanks for the comments!

ypriverol commented 3 years ago

Hi @ypriverol thanks a lot. I like option 1- very much. So to be clear: there will be duplicated file names?

Yes. We have the same case when multiple samples are multiplexed in the same RAW file.

enryH commented 3 years ago

I guess option one is fine if the python client can identify such a case?

jgriss commented 3 years ago

Hi all,

We already have this case covered in some sorts for isobarically labelled experiments (see PXD017799 as an example). Here, we also have mixtures of multiple, independent samples in one raw file.

I therefore strongly suggest to stay consistent with the design approach that was chosen there, which essentially is what @ypriverol mentioned as option 1.

In case of isobarically labelled experiment, this could even be extended to have multiple rows referencing the same channel in the raw file.

@enryH

mlocardpaulet commented 2 years ago

Hello again, sorry it took me so long to come back to this. I am looking at the headers that have been utilised in the SDRF generated to date and I see that characteristics[concentration of] is used to define the concentration of compounds defined in characteristics[compound]. So if we go with the option 1 (if I understood well: one row per sample in the pool, with the respective quantities annotated in characteristics[concentration of]), can you distinguish the 2 usages of characteristics[concentration of]? Could this be an issue?

enryH commented 2 years ago

Hmm. If there is characteristics[organism] and characteristics[compound] then I guess it has to be ordered, but I am not 100% sure about this:

characteristics[organism] characteristics[concentration of] characteristics[compound] characteristics[concentration of]

Could you explain the type of experiment where this is an issue?

But I agree that this could be an issue if it leads to ambiguous interpretations.

mlocardpaulet commented 2 years ago

Hello,

I guess you are right, I cannot see an example where it would be used.