INCATools / biosample-analysis

analysis of biosamples in INSDC
3 stars 1 forks source link

Map biosample table fields to MIxS packages #53

Closed realmarcin closed 3 years ago

realmarcin commented 3 years ago

The specific task is to create a mapping file with the list of MIxS packages containing each field, if any. Since the biosample table already conforms to the MIxS schema, most field names should match MIxS field names with a small unmappeable remainder.

cmungall commented 3 years ago

yes, the table already conforms, so it should be straightforward to dump as a list the small handful of fields (e.g. entrez_value) that are not in mixs. What is this for?

hrshdhgd commented 3 years ago

@realmarcin , I have uploaded files showing the common and difference between MIxS terms and the data columns (biosample fields)

wdduncan commented 3 years ago

Sorry I wasn't clear on the call. The elements that I extracted from the biosample_set.xml file where ones that had a harmonized_name property (see example below). I was my understanding that these mapped to mixs terms.

In addition to these, I extracted some other top level elements, such as:

Details are in util/harmonized-eav.pl.

<Attributes>
    <Attribute attribute_name="finishing strategy (depth of coverage)">Level 3: Improved-High-Quality Draft11.6x;20</Attribute>
    <Attribute attribute_name="collection date" harmonized_name="collection_date" display_name="collection date">not determined</Attribute>
    <Attribute attribute_name="estimated_size" harmonized_name="estimated_size" display_name="estimated size">2550000</Attribute>
    <Attribute attribute_name="sop">http://hmpdacc.org/doc/CommonGeneAnnotation_SOP.pdf</Attribute>
    <Attribute attribute_name="project_type">Reference Genome</Attribute>
    <Attribute attribute_name="host" harmonized_name="host" display_name="host">Homo sapiens</Attribute>
    <Attribute attribute_name="lat_lon" harmonized_name="lat_lon" display_name="latitude and longitude">not determined</Attribute>
    <Attribute attribute_name="biome" harmonized_name="env_broad_scale" display_name="broad-scale environmental context">terrestrial biome [ENVO:00000446]</Attribute>
    <Attribute attribute_name="misc_param: HMP body site">not determined</Attribute>
    <Attribute attribute_name="nucleic acid extraction">not determined</Attribute>
    <Attribute attribute_name="feature" harmonized_name="env_local_scale" display_name="local-scale environmental context">human-associated habitat [ENVO:00009003]</Attribute>
    <Attribute attribute_name="investigation_type" harmonized_name="investigation_type" display_name="investigation type">missing</Attribute>
    <Attribute attribute_name="host taxid" harmonized_name="host_taxid" display_name="host taxonomy ID">9606</Attribute>
    <Attribute attribute_name="project_name" harmonized_name="project_name" display_name="project name">Alistipes putredinis DSM 17216</Attribute>
    <Attribute attribute_name="assembly">PCAP</Attribute>
    <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">not determined</Attribute>
    <Attribute attribute_name="source_mat_id" harmonized_name="source_material_id" display_name="source material identifiers">DSM 17216, CCUG 45780, CIP 104286, ATCC 29800, Carlier 10203, VPI 3293</Attribute>
    <Attribute attribute_name="material" harmonized_name="env_medium" display_name="environmental medium">biological product [ENVO:02000043]</Attribute>
    <Attribute attribute_name="ref_biomaterial" harmonized_name="ref_biomaterial" display_name="reference for biomaterial">not determined</Attribute>
    <Attribute attribute_name="misc_param: HMP supersite">gastrointestinal_tract</Attribute>
    <Attribute attribute_name="num_replicons" harmonized_name="num_replicons" display_name="number of replicons">not determined</Attribute>
    <Attribute attribute_name="sequencing method">454-GS20, Sanger</Attribute>
    <Attribute attribute_name="isol_growth_condt" harmonized_name="isol_growth_condt" display_name="isolation and growth condition">not determined</Attribute>
    <Attribute attribute_name="env_package" harmonized_name="env_package" display_name="environmental package">missing</Attribute>
    <Attribute attribute_name="strain" harmonized_name="strain" display_name="strain">DSM 17216</Attribute>
    <Attribute attribute_name="isolation-source" harmonized_name="isolation_source" display_name="isolation source">missing</Attribute>
    <Attribute attribute_name="type-material">type strain of Alistipes putredinis</Attribute>
  </Attributes>
realmarcin commented 3 years ago

yes, the table already conforms, so it should be straightforward to dump as a list the small handful of fields (e.g. entrez_value) that are not in mixs. What is this for?

@cmungall This is for:

cmungall commented 3 years ago

1 and 3 should be data driven, but this should all fall out of the generic annotator project

hrshdhgd commented 3 years ago

54 addresses this ticket.