INCATools / biosample-analysis

analysis of biosamples in INSDC
3 stars 1 forks source link

normalize ENVO terms #25

Open cmungall opened 4 years ago

cmungall commented 4 years ago

These are mostly strings. Some do not correspond to a class label, e.g. 'tundra'

There should be a repair step that gets the IDs. I suggest a denormalized/flattened schema where we append _id onto the field name, e.g. env_local_scale_id=ENVO:nnnn. In the NMDC/MIxS schema this is a compound object

turbomam commented 3 years ago

Is this a matter of normalizing ENVO terms to something (more authoritative? better structured? better coverage?)

Or is it a matter of normalizing from the NMDC/MIxS schema to ENVO?

Or from user-submitted values (intended for NMDC/MIxS) to ENVO?

cmungall commented 3 years ago

name->ID

cmungall commented 3 years ago

let's look at the input table

$ mlr --ocsv --itsvlite cut -f accession,package_name,env_broad_scale,env_medium,env_local_scale downloads/harmonized-table.tsv then filter 'env_broad_scale != ""'
accession,env_broad_scale,env_medium,package_name,env_local_scale
SAMN00000002,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
SAMN00000003,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
SAMN00000004,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]

^^ these are ok. This also conforms to our schema

  env_broad_scale:
    is_a: environment field
    aliases:
    - broad-scale environmental context
    description: "In this field, report which major environmental system your sample\
      \ or specimen came from. The systems identified should have a coarse spatial\
      \ grain, to provide the general environmental context of where the sampling\
      \ was done (e.g. were you in the desert or a rainforest?). We recommend using\
      \ subclasses of ENVO\u2019s biome class: http://purl.obolibrary.org/obo/ENVO_00000428.\
      \ Format (one term): termLabel [termID], Format (multiple terms): termLabel\
      \ [termID]|termLabel [termID]|termLabel [termID]. Example: Annotating a water\
      \ sample from the photic zone in middle of the Atlantic Ocean, consider: oceanic\
      \ epipelagic zone biome [ENVO:01000033]. Example: Annotating a sample from the\
      \ Amazon rainforest consider: tropical moist broadleaf forest biome [ENVO:01000228].\
      \ If needed, request new terms on the ENVO tracker, identified here: http://www.obofoundry.org/ontology/envo.html"
    pattern: '{termLabel} {[termID]}'
    examples:
    - value: forest biome [ENVO:01000174]

but look at others

accession,env_broad_scale,env_medium,package_name,env_local_scale
...
SAMN00001340,aquatic,saline water,"MIMS: metagenome/environmental, water; version 5.0",Pacific Ocean
SAMN00001362,aquatic,saline water,"MIMS: metagenome/environmental, water; version 5.0",Pacific Ocean

^^ the submitter gave strings not IDs. We want to fix

replace aquatic with ENVO ID for aquatic biome

replace saline water with ENVO ID for aquatic biome

I think "pacific ocean" is just the wrong string for env_local_scale

for ones that can't be matched, just report and move on

replace each string with mixs syntax

"LABEL [ENVO:nnnn]"

turbomam commented 3 years ago

@hrshdhgd have you done much with this yet? @wdduncan helped me find relevant input data and utilities and I have been reading about MIxS in general. I think I could do the following now: map unique values from env_broad_scale, env_medium and env_local_scale to the "LABEL [ENVO:nnnn]" notation, as TSV output.

turbomam commented 3 years ago

Also @cmungall and others, it seems that accession is very frequently blank. I know that it wouldn't make sens to map that, but it makes me a little uncomfortable to see so many blanks in what might be the primary key for this table

wdduncan commented 3 years ago

@turbomam I am normalizing the package names in ticket #24

Also, the primary key is in id field (e.g., BIOSAMPLE:SAMN00000002).

turbomam commented 3 years ago

Thaks @wdduncan

I'm curious, but this is probably not relevant to this task: What is accession used for vs. id?

wdduncan commented 3 years ago

@turbomam I'm not sure about the meaning of the accession field. It seems to be some kind of identifier that the INCA uses. But there are other ways the identifiers are captured in the biosample_set.xml; e.g., here is an xml blob from that file:

<BioSample submission_date="2008-04-04T08:44:24.950" last_update="2019-06-20T16:11:22.271" publication_date="2008-04-04T00:00:00.000" access="public" id="2" accession="SAMN00000002">
  <Ids>
    <Id db="BioSample" is_primary="1">SAMN00000002</Id>
    <Id db="WUGSC" db_label="Sample name">19655</Id>
    <Id db="SRA">SRS000002</Id>
  </Ids>
....
</Biosample>

In this case the accession has a value.

hrshdhgd commented 3 years ago

@turbomam, by accession, you mean the column named accession_biosample_id, correct?

@hrshdhgd have you done much with this yet? @wdduncan helped me find relevant input data and utilities and I have been reading about MIxS in general. I think I could do the following now: map unique values from env_broad_scale, env_medium and env_local_scale to the "LABEL [ENVO:nnnn]" notation, as TSV output.

I have not yet. I think that seems like a good plan.

  • I guess interleaving those mappings back into harmonized-table.tsv shouldn't be too hard, but I haven't planned that out yet.

I'm guessing a JOIN using id and accession_biosample_id as keys should do the trick?

  • I haven't planned any quality filters yet either

Something we'll need to discuss further

There is a field named environmental package there. That could be the mapping

hrshdhgd commented 3 years ago

I also just noticed that the accession_biosample_id is just a suffix to the id column if that is of any value.

hrshdhgd commented 3 years ago

I have been working on runNER some more and I have added the following features:

  1. Added a column 'SENTENCE' to show the relevant sentence in which the tagged token appears for context to users.
  2. Added a suffix '_SYNONYM' for synonym terms tags by OGER.

Question: @cmungall , while adding the MIxS syntax in the format - LABEL [ENVO:nnnn], would you expect the same format for synonyms e.g. LABEL [ENVO:nnnn_SYNONYM] or no?