ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Biosample tissue site #707

Open david4096 opened 8 years ago

david4096 commented 8 years ago

Biosamples are normally from a tissue sample that can be named. This data should be represented as a named field of a Biosample message, i.e. tissue_type. I suppose one might use something like the Foundational Model of Anatomy to restrict the vocabulary. @mbaudis @nishill

mbaudis commented 8 years ago

I am for solving this through an OntologyTerm object, as part of a general BioFeatures collection object which contains information private to the BioSample. Others would be histology, anatomic location, associated disease (of the ->tissue<- as a somatic phenotypic variation...). We seem to address this in discussions happening at the moment in the MTT, and on documents such as BioFeature and use cases documents.

mcourtot commented 8 years ago

+1, adding suggestion to use Uberon for anatomical parts (it does map to FMA amongst others)

mdmiller53 commented 8 years ago

here's a snippet of how TCGA classifies tissues. the total number of combinations is 188:

mysql> select sampletype, tissue_anatomic_site, tissue_anatomic_site_description, count(*) ct 
from metadata_biospecimen 
where tissue_anatomic_site in ('Head/Neck', 'Eye', 'Colon') 
group by sampletype, tissue_anatomic_site, tissue_anatomic_site_description;
+---------------------+----------------------+----------------------------------+-----+
| sampletype          | tissue_anatomic_site | tissue_anatomic_site_description | ct  |
+---------------------+----------------------+----------------------------------+-----+
| Primary solid Tumor | Colon                | NULL                             |   2 |
| Primary solid Tumor | Colon                | Ascending Colon                  |  98 |
| Primary solid Tumor | Colon                | Cecum                            | 110 |
| Primary solid Tumor | Colon                | Descending Colon                 |  22 |
| Primary solid Tumor | Colon                | Hepatic Flexure                  |  24 |
| Primary solid Tumor | Colon                | Splenic Flexure                  |   7 |
| Primary solid Tumor | Colon                | Transverse Colon                 |  38 |
| Primary solid Tumor | Eye                  | Choroid                          |  64 |
| Primary solid Tumor | Eye                  | Ciliary body                     |  15 |
| Primary solid Tumor | Eye                  | Iris                             |   1 |
| Primary solid Tumor | Head/Neck            | NULL                             |   4 |
| Primary solid Tumor | Head/Neck            | Alveolar                         |  17 |
| Primary solid Tumor | Head/Neck            | Base of the Tongue               |  27 |
| Primary solid Tumor | Head/Neck            | Buccal Mucosa                    |  23 |
| Primary solid Tumor | Head/Neck            | Floor of Mouth                   |  62 |
| Primary solid Tumor | Head/Neck            | Hypopharynx                      |  10 |
| Primary solid Tumor | Head/Neck            | Larynx, NOS                      | 116 |
| Primary solid Tumor | Head/Neck            | Lip                              |   3 |
| Primary solid Tumor | Head/Neck            | Oral Cavity, NOS                 |  74 |
| Primary solid Tumor | Head/Neck            | Oral Tongue                      |  83 |
| Primary solid Tumor | Head/Neck            | Oropharynx                       |   9 |
| Primary solid Tumor | Head/Neck            | Palate, Hard                     |   7 |
| Primary solid Tumor | Head/Neck            | Tongue, NOS                      |  50 |
| Primary solid Tumor | Head/Neck            | Tonsil                           |  45 |
+---------------------+----------------------+----------------------------------+-----+
24 rows in set (0.02 sec)
david4096 commented 8 years ago

Thanks! This thread was started specifically to address how to model the TCGA data! Is there a minimum change to the biosample message that would allow this? Perhaps we might add tissue type as an ontology term from Uberon?

mbaudis commented 8 years ago

There seems to be consent among the MTT to

This would then address tissue type etc, since this could just be an Uberon OT.

There is discussion of this in the BioFeature document - see esp. page 2. In the case of BioSample, we can go a less nested route than the abstraction discussed there since the time attributes & description are already there and we probably can live with a single set of terms. So: Implement the bullet points above = practical solution.

(We have drafted a large list of use cases to identify what else is needed esp. for BioSample / Individual; this will soon be addressed piecemeal...).

mbaudis commented 8 years ago

@david4096 Please have a look at https://github.com/ga4gh/schemas/pull/710 (sorry for the committ mess...).

david4096 commented 8 years ago

It works for me, I think if we go with a tag-bag approach a logical OR would satisfy most use cases.

mbaudis commented 8 years ago

Thanks @david4096. Other comments/votes, please; @mcourtot, @mdmiller53, @sarahhunt ...?

mdmiller53 commented 8 years ago

yes, the changes in #710 look fine to me. still minorly concerned whether OntologyTerm needs to be richer (i.e. allow qualifying OntologyTerms and OntologyValues) but that's a different discussion and will undoubtedly be driven by use cases

mbaudis commented 8 years ago

@mdmiller53 Thanks; and regarding the qualifiers, the point is certainly well taken. But options are either in the OT, or in the wrapper object. The structure there is not immediately obvious; wrapper seems more sane (e.g. you qualify your diagnostic call, and the OTs are only abstractions of this); but this wouldn't fit very well here where the wrapper is basically a first level object, and the characteristics heterogeneous. Still best solution IMO would be akin the "Biofeatures" (any name allowed) list containing "Biofeature" wrappers etc.; see the document. But this will be a separate issue.

mbaudis commented 8 years ago

So pls. vote/comment on https://github.com/ga4gh/schemas/issues/711 now.

kozbo commented 7 years ago

711 got closed in favor of #725 so pls. vote/comment on that one now :-)

mdmiller53 commented 7 years ago

so minorly confused here, are we considering this issue or #725 or both for voting?

kozbo commented 7 years ago

@mdmiller53 comment from #711 : "Following the discussions at Vancouver: Closing this in favour #725." #725 implements this issue.

mbaudis commented 7 years ago

@mdmiller53 I'll close this. https://github.com/ga4gh/schemas/pull/725 (which was merged into metadata-integration branch) defines this as being covered through better definition of OntologyTerms (termId + termLabel, and URI provided through a service), and these being represented through Biocharacteristic-type phenotypes and diseases.

kozbo commented 7 years ago

Reopening this as we have no way to track that this fix isn't merged to master yet without this issue. So will close once the metadata-integration branch is merged into master.

david4096 commented 7 years ago

@mbaudis with the characteristics we have improved the granularity of describing phenotypes, however, following on our conversations Monday, I believe we need to add a tissue_site that allows one to state using an ontology term where a sample was taken from.

When both a tumor and healthy sample have been derived from an individual, there should be a clear field that states where the sample was taken from in either case.

mbaudis commented 7 years ago

@david4096 My take here:

In principle, one could go with the way we discussed - you can have everything in a list of Biocharacteristics. However, it may be better to have a similar Provenance collection, which could contain multiple Characteristic objects describing different aspects of the sample's origin.

mcourtot commented 7 years ago

Maybe something like sample_source? I think having a collection would be helpful, as we could have cell lines derived from specific tissues for example, and we'd want to capture both in a characteristic object, e.g.

sample_source: [    {
         description: “breast carcinoma cell line”,
         repeated OntologyTerm ontologyTerms: [
             {
               term_id:  “CLO:0009468”,
               term_label:  “UACC-893 cell”,
             },
             {
               term_id:  “EFO:0000305”,
               term_label:  “breast carcinoma”,
             },
             {
             term_id: “UBERON:0000310”,
             term_label:  “breast”,
              },
      ],
     } ]

I'm not sure if we'd want to name nested attributes, for example in this case "cell_line", "disease", "organism_part". Not naming them makes for an easier schema, but slightly less precise query.

mbaudis commented 7 years ago

@david4096 @mcourtot Yes, and in fact in arrayMap we use "SAMPLESOURCE" for similar purposes (cell line, metastasis::liver ...).

So if somebody wants to craft a PR for this ...

mcourtot commented 7 years ago

PR created - I think changes required are fairly minimal as we already have the Biocharacteristic objects, but please review!

mdmiller53 commented 7 years ago

for consistency, it might be good to have a companion best practice documentation for the different types. for instance 'for cell lines, these are the recommended ontology fields, for metagenomics ..., for human ..., etc.', perhaps based on minimum information standards where appropriate

mbaudis commented 7 years ago

@mdmiller53 Yes, exactly. There are many things where documentation will be a very important element of efficient use.