Open cmungall opened 4 years ago
Is this a matter of normalizing ENVO terms to something (more authoritative? better structured? better coverage?)
Or is it a matter of normalizing from the NMDC/MIxS schema to ENVO?
Or from user-submitted values (intended for NMDC/MIxS) to ENVO?
name->ID
let's look at the input table
$ mlr --ocsv --itsvlite cut -f accession,package_name,env_broad_scale,env_medium,env_local_scale downloads/harmonized-table.tsv then filter 'env_broad_scale != ""'
accession,env_broad_scale,env_medium,package_name,env_local_scale
SAMN00000002,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
SAMN00000003,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
SAMN00000004,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
^^ these are ok. This also conforms to our schema
env_broad_scale:
is_a: environment field
aliases:
- broad-scale environmental context
description: "In this field, report which major environmental system your sample\
\ or specimen came from. The systems identified should have a coarse spatial\
\ grain, to provide the general environmental context of where the sampling\
\ was done (e.g. were you in the desert or a rainforest?). We recommend using\
\ subclasses of ENVO\u2019s biome class: http://purl.obolibrary.org/obo/ENVO_00000428.\
\ Format (one term): termLabel [termID], Format (multiple terms): termLabel\
\ [termID]|termLabel [termID]|termLabel [termID]. Example: Annotating a water\
\ sample from the photic zone in middle of the Atlantic Ocean, consider: oceanic\
\ epipelagic zone biome [ENVO:01000033]. Example: Annotating a sample from the\
\ Amazon rainforest consider: tropical moist broadleaf forest biome [ENVO:01000228].\
\ If needed, request new terms on the ENVO tracker, identified here: http://www.obofoundry.org/ontology/envo.html"
pattern: '{termLabel} {[termID]}'
examples:
- value: forest biome [ENVO:01000174]
but look at others
accession,env_broad_scale,env_medium,package_name,env_local_scale
...
SAMN00001340,aquatic,saline water,"MIMS: metagenome/environmental, water; version 5.0",Pacific Ocean
SAMN00001362,aquatic,saline water,"MIMS: metagenome/environmental, water; version 5.0",Pacific Ocean
^^ the submitter gave strings not IDs. We want to fix
replace aquatic with ENVO ID for aquatic biome
replace saline water with ENVO ID for aquatic biome
I think "pacific ocean" is just the wrong string for env_local_scale
for ones that can't be matched, just report and move on
replace each string with mixs syntax
"LABEL [ENVO:nnnn]"
@hrshdhgd have you done much with this yet? @wdduncan helped me find relevant input data and utilities and I have been reading about MIxS in general. I think I could do the following now: map unique values from env_broad_scale
, env_medium
and env_local_scale
to the "LABEL [ENVO:nnnn]" notation, as TSV output.
package_name
to ENVO. Does that make sense to you?Also @cmungall and others, it seems that accession
is very frequently blank. I know that it wouldn't make sens to map that, but it makes me a little uncomfortable to see so many blanks in what might be the primary key for this table
@turbomam I am normalizing the package names in ticket #24
Also, the primary key is in id
field (e.g., BIOSAMPLE:SAMN00000002
).
Thaks @wdduncan
I'm curious, but this is probably not relevant to this task:
What is accession
used for vs. id
?
@turbomam I'm not sure about the meaning of the accession
field. It seems to be some kind of identifier that the INCA uses. But there are other ways the identifiers are captured in the biosample_set.xml
; e.g., here is an xml blob from that file:
<BioSample submission_date="2008-04-04T08:44:24.950" last_update="2019-06-20T16:11:22.271" publication_date="2008-04-04T00:00:00.000" access="public" id="2" accession="SAMN00000002">
<Ids>
<Id db="BioSample" is_primary="1">SAMN00000002</Id>
<Id db="WUGSC" db_label="Sample name">19655</Id>
<Id db="SRA">SRS000002</Id>
</Ids>
....
</Biosample>
In this case the accession
has a value.
@turbomam, by accession
, you mean the column named accession_biosample_id
, correct?
@hrshdhgd have you done much with this yet? @wdduncan helped me find relevant input data and utilities and I have been reading about MIxS in general. I think I could do the following now: map unique values from
env_broad_scale
,env_medium
andenv_local_scale
to the "LABEL [ENVO:nnnn]" notation, as TSV output.
I have not yet. I think that seems like a good plan.
- I guess interleaving those mappings back into harmonized-table.tsv shouldn't be too hard, but I haven't planned that out yet.
I'm guessing a JOIN
using id
and accession_biosample_id
as keys should do the trick?
- I haven't planned any quality filters yet either
Something we'll need to discuss further
- https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/ doesn't seem to suggest mapping
package_name
to ENVO. Does that make sense to you?
There is a field named environmental package
there. That could be the mapping
I also just noticed that the accession_biosample_id
is just a suffix to the id
column if that is of any value.
I have been working on runNER some more and I have added the following features:
Question: @cmungall , while adding the MIxS syntax in the format - LABEL [ENVO:nnnn]
, would you expect the same format for synonyms e.g. LABEL [ENVO:nnnn_SYNONYM]
or no?
These are mostly strings. Some do not correspond to a class label, e.g. 'tundra'
There should be a repair step that gets the IDs. I suggest a denormalized/flattened schema where we append _id onto the field name, e.g. env_local_scale_id=ENVO:nnnn. In the NMDC/MIxS schema this is a compound object