normalize package names

cmungall commented 4 years ago

This should be done as a pre-processing step, part of overall ETL pipeline, such that each individual analysis does not need to do normalization

Currently done for water packages here: https://nbviewer.jupyter.org/github/INCATools/biosample-analysis/blob/master/src/notebooks/water-package-profiling.ipynb

I am envisioning a general toolkit that performs this kind of repair on the whole TSV

wdduncan commented 4 years ago

My workflow for normalizing the water package names was like this:

find all env_packages containing the string 'water'
drop packages that don't seem to be for water; e.g. 'freshwater sediment', 'wastewater sludge'
Map obvious names to water; e.g., 'MIGS/MIMS/MIMARKS.water' => 'water'

I put the mapped env_values in field name 'norm_env_package'. Perhaps we should reference the MIxS package label; e.g., 'mixs5_env_package'. That make it more clear which env_package name we are normalizing on.

Also, I left some env_package values as their original value in the norm_env_package field; e.g. 'sea water', 'waste water'. On reflection, I think I should I have mapped these to 'water' b/c that is the name of the MIxS package. The original values are still in the env_package field.

But do we want to normalize on the subset of 'water' packages to a normalized name? For example, do we want to normalize 'wastewater' and 'waste water'?

My proposal for normalization mappings:

Standardize spelling differences in spelling, capitalization, etc. in the normalized_env_package field. E.g.; map 'waste water', wastewater', and 'MIGS/MIMS/MIMARKS.wastewater' to normalized_env_package: 'waste water'.
Create a 'mixs5_env_package' field to map to the env packages in the standard. E.g., 'waste water' would map to water.

cc @cmungall @realmarcin

wdduncan commented 4 years ago

@cmungall In the short term we can normalize on the controlled terms in the mixs standard. But, in the long term it would be good to normalize the package names by referencing URIs in the mixs-rdf project. We haven't created URIs for package names yet, but these seems like the next logical step.

wdduncan commented 3 years ago

Ignore the link the ENVO issue about medical infrastructure. I accidentally posted it here.

wdduncan commented 3 years ago

Results of package names provided by @cmungall in file target/distinct-env_package.tsv

49254  host-associated
47921  human-gut
16367  water
13706  human-skin
12391  built environment
11976  soil
11715  misc environment
8453  missing
7882  human-oral
5969  sediment
3786  MIGS/MIMS/MIMARKS.soil
3167  human-associated
2988  MIGS/MIMS/MIMARKS.host-associated
2499  MIGS/MIMS/MIMARKS.human-gut
2129  microbial mat/biofilm
2076  plant-associated
1837  MIGS/MIMS/MIMARKS.water
1417  MIGS/MIMS/MIMARKS.human-associated
1189  MIGS/MIMS/MIMARKS.sediment
1154  MIGS/MIMS/MIMARKS.human-oral
1077  human-vaginal
1063  MIGS/MIMS/MIMARKS.plant-associated
741   MIGS/MIMS/MIMARKS.microbial
611   miscellaneous natural or artificial environment
558   MIGS/MIMS/MIMARKS.miscellaneous
479   mimarks
417   wastewater|sludge
406   mouse-gut
385   MIGS/MIMS/MIMARKS.wastewater
357   wastewater/sludge
283   unknown
212   Human-associated
206   MIGS/MIMS/MIMARKS.air
201   Human-oral
172   gut
171   host_associated
152   air
135   MIGS/MIMS/MIMARKS.human-skin
114   biofilm
111   Human-gut
107   human-not providedsopharyngeal
90   wastewater sludge
87   mice gut
61   built
60   CV
59   human gut
59   Human_Gut
51   default
48   microbial mat|biofilm

INCATools / biosample-analysis

normalize package names #24