bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
76 stars 107 forks source link

Mixed samples #223

Closed lisavetasol closed 4 years ago

lisavetasol commented 4 years ago

How should a mixed sample be reflected in the annotation? For example, there is a separate analysis of different individuals and all pooled together or just a commercial plasma sample from lots of different people mixed together. Is it okay to use "not applicable" in comment[individual] and "mixed_0", "mixed_1" (or "commercial_plasma_0" in the second case) and so on in the source name column?

ypriverol commented 4 years ago

@anjaf can you help us with this. I have the same problem with a TMT experiment before where a channel is used for a pool of all samples (https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD017710/sdrf-tmt.tsv#L209). This is really common because you merge samples into a simple RAW to boost the "signal" you are looking for.

This is a really important issue because we have a lot of experiments with pool mixed samples.

anjaf commented 4 years ago

I think there are a few different options, depending on what works for you. Based on the example here, I would try to avoid "not applicable" and go for "mixed x and y" to be most descriptive. Alternatively, you could create one row per sample that goes in the mix. Then you can annotate them individually in their separate row. And then each sample of the mix is linked with the same assay. Obviously this would bloat the SDRF somewhat (as assay rows need to be repeated). And this might make it a bit less clear that this was a mix of samples that was the input of the assay. Also it might require changes in the spec and the validator.

levitsky commented 4 years ago

The second solution makes sense given that a row in SDRF represents a relation between a sample and an assay. However, for a small dataset with 15 files where the sample is pooled from 20 subjects, we end up with 300 rows. It's good for automated processing, though, and duplication of data for the same "assay"/raw file would be on the same scale as with labelled experiments.

levitsky commented 4 years ago

@ypriverol what do you think? Would be nice to have a consensus on this.

ypriverol commented 4 years ago

the current file format @anjaf @levitsky @jgriss is quite de-normalized if you consider:

As @levitsky mentioned for 15 samples, we can easily end with over 300 rows. In additions, pooled samples are not relevant biologically but more technically.

I suggest the following approach, for all pooled samples:

Material -> NT=Pooled Sample;PS=Sample1,Sample2,Sample.... We used a key-value pair structure where the NT is Pooled Sample (https://www.ebi.ac.uk/ols/ontologies/ncit/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCIT_C165587) and PS (Pooled Samples) are the references to all the samples identifiers in the mix.

Opinions @levitsky @anjaf @jgriss

levitsky commented 4 years ago

How would the other rows look? Those where the samples are actually annotated.

Each row captures the relation between one sample and one raw file. It seems like you imply that the sample is annotated in a row for an MS run that corresponds to the analysis of this individual sample, and then we "link" to this row from a pooled sample row. But the sample can only ever be present in a pool. So, for raw file you have the pooled sample raw file. In that case it doesn't make sense to me to have a row for the pooled sample at all, or you would have 11 rows for a pool of 10 samples.

I hope I'm making sense to you :)

jgriss commented 4 years ago

Hi @ypriverol ,

I agree with @levitsky. Although I like the idea to explicitly annotate pooled samples, you'd still have to annotate the actual samples somewhere. And this breaks the concept of one raw file - one sample relationship in the format.

There are several cases where the files will explode quite a bite due to multiplexed or pooled samples (see #352). But there we were rather thinking of solving this issue with improved visualisation tools rather than to change the format.

Adding such exemptions from the primary underlying rule (one sample - one raw file) can quickly greatly increase the complexity of parsing the files.

ypriverol commented 4 years ago

My idea is the following:

Sample 1 ... Homo Sapiens... condition 1... msrun 1... TMT131... file.raw Sample 2 ... Homo Sapiens... condition 2... msrun 1... TMT131C... file.raw

These are the sample as we know in multiplex experiments, multiple samples can be in the same file.raw. This representation is what we have now. The problem is that sometimes we have in TMT (I guess SILAC is the same) an additional channel called pooled where all samples are merged.

Sample n ... Homo Sapiens... pooled... msrun 1... TMT128 ... file1.raw

This last Sample is a mixed of all samples. Becuase is only used by the analytical/bioinformatics method we don't really need to annotate again all the sample metadata. We just need to make clear that multiple samples are mixed, and which are them. My comment https://github.com/bigbio/proteomics-metadata-standard/issues/223#issuecomment-650731421

jgriss commented 4 years ago

@ypriverol Now I understand. I think that can be a useful addition of the format.

Essentially, allowing "pooled" samples to reference pre-existing source names

ypriverol commented 4 years ago

Yes that is the point.

levitsky commented 4 years ago

This does look useful but how do we limit the scope of this suggestion? Do we just say that we can always refer to any source names present in another row in the file? Let's say I have two pools in the dataset and for the first one I have to list all the samples, but for the second one five out of ten samples are already annotated as parts of the first pool. How do I annotate the second pool?

ypriverol commented 4 years ago

This is up to the submitter. What we are trying to guaranty that if someone has a pooled or mixed in their sample, they have a way to annotate it without needs to annotate everything again. How they mix the samples is really in the submitters and annotator's side.

levitsky commented 4 years ago

So for a pooled sample you can refer to any source name within the file, or you can repeat rows as before, in all cases?

Also, I can't help but wonder, what exactly is the problem with larger files and repeated data? SDRF functions as an associative table for a many-to-many relationship between samples and files and as such inherently contains repetitions. I feel like it's really by design. It's repetitive but easy to work with, programmatically.

Yes, we can cut down quite a lot on size, but why do we need it? We already have gzip compression for efficient storage. For human readability, #352 discusses visualization tools and this is a great idea. My strong opinion has always been this: Any inconvenience of human readers or human annotators should be mitigated by improving reading/annotation software (annotator tool or some kind of annotation viewer) and not by complicating the standard. Ultimately the utility of SDRF is that it enables automated data processing, and if we make it difficult to parse then its value is lost.

I'm not saying this specific proposal is necessarily harmful, but I'm asking if there is a clear benefit in it and if complicating the standard really is the best way to pursue that benefit.

P.S. Another comment I have: you are suggesting to use Material (do you mean Material Type?). I suggest using a dedicated column for this, like characteristics[pooled sample]. We can still use key-value style, but I'm not sure what other keys would be needed except for PS. For Material Type the previous decision was to treat it as a no-op.

ypriverol commented 4 years ago

LEt have the input of more people @jgriss @anjaf @mlocardpaulet @veitveit @mvaudel

StSchulze commented 4 years ago

I basically fully agree with @levitsky , a clear and consistent annotation that can be automatically processed appears more important to me than some arguably minor inconvenience for the annotator. That's why we had decided to have multiple rows for files that are multiplexed (#327).

Generally I would also vote against the introduction of a new column, since this problem should be solved within existing condition and label columns (otherwise you end up with the problem of what to enter in those columns).

I'm not completely familiar with what different experimental setups exist, but in principle I can imagine the following scenarios:

  1. samples from different conditions are labeled before mixing them (i.e. classical multiplexing). This would apply to all SILAC experiments (if I'm not wrong) and most iTRAQ/TMT experiments. --> For a raw file from a pool of 10 samples, each with a unique label, you would end up with 10 rows. That's what was covered in #327.

  2. samples from different conditions are mixed before labeling (or not labeled at all). --> Adding a new row for each condition still works, the label and file name columns would stay the same, only the condition column changes. As others mentioned, this bloats up the file, sure, but that can be dealt with in different ways.

  3. In TMT/iTRAQ experiments, you can have one channel that used labeling after mixing of all samples and you can mix this pooled sample into other multiplexed samples (for normalization purposes). So taking @ypriverol example this would look like this (its all the same raw file):

Sample 1 ... Homo Sapiens... condition 1... msrun 1... TMT131... file.raw Sample 2 ... Homo Sapiens... condition 2... msrun 1... TMT131C... file.raw Sample n ... Homo Sapiens... pooled... msrun 1... TMT128 ... file.raw

--> Adding new rows for each sample of the pool, for each file in which that pooled sample has been multiplexed into other samples bloats the file really a lot and would arguably be more than just a minor inconvenience for the annotator. So I understand the desire to use a "pooled" annotation here. However, one alternative could be to add multiple columns instead of multiple rows. In general it is allowed to have multiple columns with the same title, so you can just add multiple condition columns for the pooled sample that cover all the different conditions. The example would look like this:

Sample n ... Homo Sapiens... condition 1 ... condition 2 ... msrun 1... TMT128 ... file.raw

This would still bloat the file a bit, but since I would expect that only a few columns are duplicated (i.e. a few properties of the sample are changing), the overall duplication would be way less than adding rows.

levitsky commented 4 years ago

Thank you @StSchulze, to quickly follow up with your comment:

  1. By breaking down possible use cases, you are reinforcing my question about limiting the scope of this suggestion. If we only use it in case 3, it's one thing; using it freely for any repeated sample is different. I think in the specific case you and @ypriverol describe it could be applied without breaking other logic.

  2. I'm not sure I agree with your point about using "condition" columns and repeating them. I think any column can be a "factor" column that differentiates between samples; the key point of the suggestion is that we link the sample identities, not just some characteristics. As such, if we go with repeating columns in pools, we should be repeating source name.

    However, this doesn't seem very scalable, as the number of samples pooled is virtually unlimited, so listing them in one cell would perhaps be easier.

    Using a dedicated column seems logical to me because "being a pooled sample" is a separate characteristic and using any other column to convey this would be a stretch. As for what to put in other columns like "condition", I think we should simply put not applicable.

StSchulze commented 4 years ago

True, I had overlooked the issue of associating specific samples/source names with the pooled sample. Taking that into account I would agree that the dedicated column as a separate characteristic sounds like the best idea. And I agree that it should be limited to cases like 3.