bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
79 stars 107 forks source link

Use of [phenotype] and possible alternatives #225

Closed levitsky closed 4 years ago

levitsky commented 4 years ago

I see quite a range of uses for [phenotype] in the annotations that do not really seem to fit the definition. The definition is:

The observable form taken by some character (or group of characters) in an individual or an organism, excluding pathology and disease. The detectable outward manifestations of a specific genotype.

The columns is used to both characterize the disease (stage, metastases etc.) and treatments. Basically the existing annotations almost never use the [phenotype] column to describe actual phenotype.

It has been briefly discussed in #92, where @anjaf said:

" characteristics[phenotype]: sample treated with drug A"

As a curation side-note: "phenotype" is not a good term to describe that attribute. Better annotation is "compound: drug A" and "compound: none" (for the control). The second reason why this is better is that "drug A" can be mapped to the ontology term for drug A, while "sample treated with drug A" is not an ontology term.

After that, the experimental design page was updated, and the example looks somewhat like this:

source name characteristics[compound] characteristics[phenotype] factor value[phenotype]
sample_treat necrotic tissue compound: drug A necrotic tissue
sample_control normal compound: none normal

My questions are these:

  1. Is this really what @anjaf meant to suggest? Compounds are still under [phenotype] here.
  2. Are the columns accidentally switched in this example, by any chance? Some annotated datasets follow it literally and have drugs in the phenotype column.
  3. How to accommodate other data? What would be the right terms for: compound; disease stage; response to treatment; tumor size; any other terms describing the pathology, or treatment, or their relation? The standard says that the columns names SHOULD be terms from EFO, but EFO doesn't even have compound. Here is an example of metadata available for one of the projects on PRIDE:
Age at surgery Initial Tumor Primary/Recurrence WHO Grade Tumor Location Post Surgery Progression Time to Reccurence or Last Follow up Max Tumor Size History of Radiation
56 Primary 2 Convexity Progression Free 8.2 6.4 No

How do I fit all of this in SDRF?

ypriverol commented 4 years ago

I see quite a range of uses for [phenotype] in the annotations that do not really seem to fit the definition. The definition is:

The observable form taken by some character (or group of characters) in an individual or an organism, excluding pathology and disease. The detectable outward manifestations of a specific genotype.

The columns is used to both characterize the disease (stage, metastases etc.) and treatments. Basically the existing annotations almost never use the [phenotype] column to describe actual phenotype.

It has been briefly discussed in #92, where @anjaf said:

" characteristics[phenotype]: sample treated with drug A"

As a curation side-note: "phenotype" is not a good term to describe that attribute. Better annotation is "compound: drug A" and "compound: none" (for the control). The second reason why this is better is that "drug A" can be mapped to the ontology term for drug A, while "sample treated with drug A" is not an ontology term.

After that, the experimental design page was updated, and the example looks somewhat like this:

source name characteristics[compound] characteristics[phenotype] factor value[phenotype] sample_treat necrotic tissue compound: drug A necrotic tissue sample_control normal compound: none normal My questions are these:

  1. Is this really what @anjaf meant to suggest? Compounds are still under [phenotype] here.
  2. Are the columns accidentally switched in this example, by any chance? Some annotated datasets follow it literally and have drugs in the phenotype column.

I think is this case, the columns were accidentally switched. I will fixed it.

  1. How to accommodate other data? What would be the right terms for: compound; disease stage; response to treatment; tumor size; any other terms describing the pathology, or treatment, or their relation? The standard says that the columns names SHOULD be terms from EFO, but EFO doesn't even have compound. Here is an example of metadata available for one of the projects on PRIDE:

Age at surgery Initial Tumor Primary/Recurrence WHO Grade Tumor Location Post Surgery Progression Time to Reccurence or Last Follow up Max Tumor Size History of Radiation 56 Primary 2 Convexity Progression Free 8.2 6.4 No How do I fit all of this in SDRF?

@anjaf can you help us with this example.

anjaf commented 4 years ago

Regarding "compound" not being in EFO, we actually map it to the term "chemical entity" (CHEBI_24431). EFO had done a few rounds of changes in the past, mostly dropping terms in favour of replacing them with terms imported from other ontologies. But to keep it consistent with previous curation, we usually kept referring to the category with the original term. Another example is "cell line", which is now in EFO under "cultured cell" (CL_0000010). Therefore some of the terms you can't find easily in EFO and even others are not in EFO at all.

For the medical terms, I think this is tricky to try to find an ontology term for each and every category because there are so many different types of measurements. I don't know of a good ontology that has terms for all of them. (Probably NCIt comes closest for describing tumour samples but we don't use NCIt terms in Expression Atlas.) There is a bit you can do with EFO but certainly not the same level of detail here: