bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
76 stars 107 forks source link

Biological replicate and technical replicate mandatories #379

Closed ypriverol closed 4 years ago

ypriverol commented 4 years ago

Hi all:

We have discussed before in other topics this issue #267, #218 but never as the central point of the discussion. I have presented today the format to HUPO-PSI, and one suggestion is to make biological replicate and technical replicate mandatory in the file format. The biological replicate will be in the characteristics and technical replicate will be in the comment. Please, can you give some feedback?

@levitsky @jgriss @timosachsenberg @foellmelanie @mvaudel @mlocardpaulet @david-bouyssie @veitveit

mlocardpaulet commented 4 years ago

If the biological replicate is provided, then it is possible to automatically find the technical replicates right? So it can be a bit redundant. My only concern is that technical replicate can be different things depending on the person/team/field. It can be replicate injection, replicate sample preparation from the same biological sample... How do we want to handle this?

jgriss commented 4 years ago

Hi all,

How to you imagine the biological replicate to be used? Would this refer to something like the treatment group?

Also I agree with @mlocardpaulet that technical replicate is not a well defined term / concept. And again, I am not 100% sure what to expect in this field. In technical replicates I'd expect that the source name would be identical. Just measured in different runs. Or am I mistaken there?

ypriverol commented 4 years ago

First a little background:

Multiple packages now need the definition of a technical and biological replicate, including Perseus and MSTasts (ping @timosachsenberg and @MeenaChoi). For example, MsStats capture the following fields:

- Run : MS run ID. It should be the same as Raw.file info in raw.mq
- Channel : Labeling information (channel.0, ..., channel.9).
- Condition : Condition (ex. Healthy, Cancer, Time0)
- Mixture : Mixture of samples labeled with different TMT reagents, which can be analyzed in a single mass spectrometry experiment. If the channel doesn't have a sample, please add `Empty' under Condition.
- **TechRepMixture**: Technical replicate of one mixture. One mixture may have multiple technical replicates. For example, if `TechRepMixture' = 1, 2 are the two technical replicates of one mixture, then they should match with same `Mixture' value.
- Fraction : Fraction ID. One technical replicate of one mixture may be fractionated into multiple fractions to increase the analytical depth. Then one technical replicate of one mixture should correspond to multiple fractions. For example, if `Fraction' = 1, 2, 3 are three fractions of the first technical replicate of one TMT mixture of biological subjects, then they should have same `TechRepMixture'  and `Mixture' value.
- **BioReplicate**: Unique ID for the biological subject. If the channel doesn't have a sample, please add `Empty' under BioReplicate.

What is technical replicate and biological replicate can be defined in the file format and other fields will depend of the submitter to properly annotate those fields. If Biological replicates and technical replicates are not available We should add 1 as we do now for Fractions.

@mvaudel recently published a manuscript in Arxiv that probably can help to well-define technical replicates and biological.

The terms in EFO are now:

levitsky commented 4 years ago

Defining the meaning of technical and biological replicates is non-trivial, but submitters will tend to use them in the same sense that they used in the study itself.

If the biological replicate is provided, then it is possible to automatically find the technical replicates right? So it can be a bit redundant.

I don't think technical replicates should be guessed just from the fact that the samples are identical in other parameters.

In technical replicates I'd expect that the source name would be identical. Just measured in different runs. Or am I mistaken there?

This is a great question: while we cannot decisively define the meanings of the terms used in the whole field, and those will reflect on the annotations, we need to foresee and define how these terms will play with other annotated fields. I think source name is definitely identical for technical replicates. For biological replicates they are apparently supposed to be different, since biological replicates are "biologically distinct". To try and state a general question: sometimes "biologically distinct" samples are annotated and their distinctions are already reflected in the annotation (e.g. different individual). In such cases, do we say they are biological replicates or not? This is something that feels more like a characteristic of the study than of the sample, kind of like "factor value": two samples can be treated as biological replicates or not, depending on the goal of the study. Is my understanding correct?

If this is the case, maybe it shouldn't be in characteristics. Can we define the usage of biological replicate so that it is a sample characteristic?

ypriverol commented 4 years ago

Defining the meaning of technical and biological replicates is non-trivial, but submitters will tend to use them in the same sense that they used in the study itself.

Actually, I think technical and biological replicates are quite stable concepts then we don't need to define then. Of course will be up to the user to define the correct definition.

If the biological replicate is provided, then it is possible to automatically find the technical replicates right? So it can be a bit redundant.

I don't think technical replicates should be guessed just from the fact that the samples are identical in other parameters.

Agreed!

In technical replicates I'd expect that the source name would be identical. Just measured in different runs. Or am I mistaken there?

Yes technical replicates should contains same sample characteristics same source name

This is a great question: while we cannot decisively define the meanings of the terms used in the whole field, and those will reflect on the annotations, we need to foresee and define how these terms will play with other annotated fields. I think source name is definitely identical for technical replicates. For biological replicates they are apparently supposed to be different, since biological replicates are "biologically distinct". To try and state a general question: sometimes "biologically distinct" samples are annotated and their distinctions are already reflected in the annotation (e.g. different individual). In such cases, do we say they are biological replicates or not? This is something that feels more like a characteristic of the study than of the sample, kind of like "factor value": two samples can be treated as biological replicates or not, depending on the goal of the study. Is my understanding correct?

If this is the case, maybe it shouldn't be in characteristics. Can we define the usage of biological replicate so that it is a sample characteristic?

I think we before define that biological replicate was a characteristic, what we should define is if is mandatory or optional. Even some fields are redundant remember that for example individual is only for Human samples. I think the properties is needed because it will define best practices of annotations.

mvaudel commented 4 years ago

Hi,

I agree that information on "replicates" should be mandatory but it should still be possible to publish a standalone technical data set that does not include any specific design. Eg a single run on sample of interest.

As you say, technical vs biological replicates are not necessarily clear and universal concepts, and they put the emphasis on the sample labeling (e.g. replicate 1 of 5) which is essentially useless and opens the door to (re)experimenter biases. I would therefore recommend to stay away from this terminology and rather focus on the variables used in the data analysis. If people did a drug vs placebo study controlling for sex and batch number, we need these variables for all samples. How to name these variables, e.g. biological/technical differ between fields, in the preprint @ypriverol refers to (https://arxiv.org/abs/2007.06336), we call variables of interest "treatment" variables, and variables you control for "control" variables.

Hope it helps,

Marc

david-bouyssie commented 4 years ago

Very interesting discussion and complicated topic. One big issue is that a given proteomics analysis performed on groups, samples, injections and so on, maybe analysed under different angles/perspectives. It means that for a single data matrix we may want to perform different statistical analyses and thus define different experimental designs. In this case, shall we write one SDRF file for each experimental design?

Regarding the replicates discussion I agree with you guys, that it will be hard to precisely define them. Marc's suggestion to describe variables/factors is a good way to go. I think we mainly need to think about the way these information will be read and reused. Usually in the field, the technical replicate information is ignored at the statistical analysis step, and most of the time tech replicates are averaged before the stats. Other variables/factors/conditions are used to compute the tests. Even if the nature of tech replicates can be very different in terms of sample/data acquisition they are very often analysed in a same way. However, although it is a common practice, it is maybe dangerous to make it a universal rule.

ypriverol commented 4 years ago

A couple of ideas:

First thanks for the feedback @david-bouyssie @mvaudel. The current SDRF tries to capture as much as possible the experimental design for a PX or public dataset submission. I know the same data can be organized in multiple ways but the main purpose is to capture what has been deposited. If we want to combine or analyze the data in different ways, then yes, new SDRF's should be created that represents the corresponding analysis. In fact, in PRIDE submission allows now to add multiple SDRFs files.

The technical and biological replicate are two properties to understand how the submitters consider the sample in their analysis. My aim to propose that the fields become mandatory is to encourage best practices on how to deposit the data. About this comment @mvaudel

on "replicates" should be mandatory, but it should still be possible to publish a standalone technical data set that does not include any specific design. Eg a single run on a sample of interest.

As we handle fractions now, you can annotate an experiment with no fractionation and it will add fractions 1, 1, 1. This is the same approach I think we can use for replicates. I really think most of the experiments have technical or biological replicates if we leave them optional they will not be annotated and this is my concern.

mvaudel commented 4 years ago

Yes, handling these variables like you do for the fractions would do the trick for single shots.

I would recommend to make one treatment variable mandatory, and make it possible to include more. Conversely control variables should be optional. Here, treatment refers to the variables used in the design so that their levels can be compared (e.g. drug vs placebo), and control refers to the ones that were included to be controlled for (e.g. sex or batch number) but are possibly confounded. Then, people reanalyzing the data will know what you attempted to make comparable and what you they need to be careful about.

Note that the aim is not to provide metadata on what statistical model/test was used, but just on what variables were used in the design. So if the experiment was designed to have two treatment variables (e.g. drug and sex) but in the paper they were used in separate models, or one of them was used as control, it is irrelevant, if they can be used as treatment variable they should both be reported as such.

In contradiction to my previous comment, I would recommend to also annotate variables that have been (or should have been) randomized, and the make mandatory the annotation of the order of processing of the samples.

Example:

Rank Treatment (Drug) Control (Sex) Control (Batch)
4 Placebo M 1
11 Placebo M 1
10 Placebo M 1
5 Placebo F 2
6 Placebo F 2
7 Placebo F 2
9 Drug M 1
2 Drug M 1
3 Drug M 1
8 Drug F 2
12 Drug F 2
1 Drug F 2

Finally, I would recommend getting in touch with a statistician holding some expertise in experimental design to make sure that we are not missing something / plain wrong.