Open cmungall opened 4 years ago
My workflow for normalizing the water package names was like this:
I put the mapped env_values in field name 'norm_env_package'. Perhaps we should reference the MIxS package label; e.g., 'mixs5_env_package'. That make it more clear which env_package name we are normalizing on.
Also, I left some env_package values as their original value in the norm_env_package field; e.g. 'sea water', 'waste water'. On reflection, I think I should I have mapped these to 'water' b/c that is the name of the MIxS package. The original values are still in the env_package field.
But do we want to normalize on the subset of 'water' packages to a normalized name? For example, do we want to normalize 'wastewater' and 'waste water'?
My proposal for normalization mappings:
Standardize spelling differences in spelling, capitalization, etc. in the normalized_env_package field. E.g.; map 'waste water', wastewater', and 'MIGS/MIMS/MIMARKS.wastewater' to normalized_env_package: 'waste water'.
Create a 'mixs5_env_package' field to map to the env packages in the standard. E.g., 'waste water' would map to water.
cc @cmungall @realmarcin
@cmungall In the short term we can normalize on the controlled terms in the mixs standard. But, in the long term it would be good to normalize the package names by referencing URIs in the mixs-rdf project. We haven't created URIs for package names yet, but these seems like the next logical step.
Ignore the link the ENVO issue about medical infrastructure. I accidentally posted it here.
Results of package names provided by @cmungall in file target/distinct-env_package.tsv
49254 host-associated
47921 human-gut
16367 water
13706 human-skin
12391 built environment
11976 soil
11715 misc environment
8453 missing
7882 human-oral
5969 sediment
3786 MIGS/MIMS/MIMARKS.soil
3167 human-associated
2988 MIGS/MIMS/MIMARKS.host-associated
2499 MIGS/MIMS/MIMARKS.human-gut
2129 microbial mat/biofilm
2076 plant-associated
1837 MIGS/MIMS/MIMARKS.water
1417 MIGS/MIMS/MIMARKS.human-associated
1189 MIGS/MIMS/MIMARKS.sediment
1154 MIGS/MIMS/MIMARKS.human-oral
1077 human-vaginal
1063 MIGS/MIMS/MIMARKS.plant-associated
741 MIGS/MIMS/MIMARKS.microbial
611 miscellaneous natural or artificial environment
558 MIGS/MIMS/MIMARKS.miscellaneous
479 mimarks
417 wastewater|sludge
406 mouse-gut
385 MIGS/MIMS/MIMARKS.wastewater
357 wastewater/sludge
283 unknown
212 Human-associated
206 MIGS/MIMS/MIMARKS.air
201 Human-oral
172 gut
171 host_associated
152 air
135 MIGS/MIMS/MIMARKS.human-skin
114 biofilm
111 Human-gut
107 human-not providedsopharyngeal
90 wastewater sludge
87 mice gut
61 built
60 CV
59 human gut
59 Human_Gut
51 default
48 microbial mat|biofilm
This should be done as a pre-processing step, part of overall ETL pipeline, such that each individual analysis does not need to do normalization
Currently done for water packages here: https://nbviewer.jupyter.org/github/INCATools/biosample-analysis/blob/master/src/notebooks/water-package-profiling.ipynb
I am envisioning a general toolkit that performs this kind of repair on the whole TSV