Closed bryttanytodd closed 5 years ago
FooDB uses its own identifiers for compounds. In ROBOKOP/biolink these are called "chemical_substance" and we need to figure out the synonyms between these identifiers and the other identifiers in the system.
FooDB includes columns/associations for CHEMBL, KEGG, PUBCHEM, CHEBI, DrugBank, HMDB (it has others, but these are the ones that we care about). The problem is that the relationships are incomplete. For instance: http://foodb.ca/compounds/FDB000014 has no external identifiers. But by searching on inchikey, we find that it's the same compound as https://pubchem.ncbi.nlm.nih.gov/compound/101209468.
Our normal method for comparing chemicals to find ones with the same structure is to use unichem. The problem is that unichem doesn't include foodb in its sources. I can see a few possible methods going forward:
Food is a new node type in robokop. There are a few issues:
I don't know if there is a food semantic type in the biolink model. If not, we might need to request one.
There are two identifier schemes for food: FooDB and FOODON. There does not appear to be any mapping between the two, so we can't do any meaningful synonymization as far as I can tell. I think for now, that we'll want to just use FooDB identifiers, nullop synonymization, and try to get somebody interested in fixing this. We'll also need to give a list of food identfiers, names and synonyms to Jim for Omnicorp crawling. (FOODON is being used, but again, no mapping).
There is not a huge amount of information about foods that we might want to use as a property. There's a long text string (which I think is better gotten to with a linkout). There is a Group and a Subgroup that might make nice properties though, and we might think about adding an annotator for foods to get them.
This is my evaluation of the data and the changes I had to make to get it into a smartBag.
Some files have non UTF-8 data: • Usage of the angstrom char caused me some angst. • Numerous other chars that had no mapping to UTF-8. • Switched to ISO-8859-1 and that mapped those chars correctly.
Header column count to data record element count is off in the following files (file: record count): • Compound_synonyms.csv: 2. • Contents.csv: 78599. • Foodcomex_compounds.csv: 1122. • Foods.csv: 907. • References.csv: 1.
One of the file names was a SQLite reserved word: • Changed “references” to “references1” when being used as a table name.
One of the column names was a SQLite reserved word in References.csv: • Changed “text” to “text1”.
Foods.csv had a duplicate column name: • 1 instance of the header column item “Wikipedia_id” removed.
Huge number of records in contents.csv have all null columns except for an ID.
The big relationship that we want is between foods and compounds. (The stuff in the composition section of the chemical page). It looks like it's the "Compound" rows of contents.csv. We'll want to have food->chem and chem-> food calls, each of which returns edges linking the two. There are a few complications:
There is a numerical value (the amount of the compound in the food) which we want to capture as an edge property, but we need to make sure the units are uniform.
There are some cases where in the website it says "detected but not quantified", and we'll want a way to represent that, maybe just by leaving the quantity off of the edge.
The contents are dependent on the food preparation. So for instance on Pak Choi http://foodb.ca/foods/252 there are contents that can be subset into "raw" or "cooked". I think we want to make a new edge for each preparation, which the prep as a property?
It's possible that there are multiple rows for a given preparation, if there are multiple sources providing the data. In that case, we should take the average.
There are references given for each row, but they're not pubmed ids. Instead they are links to some other databases and sources.
In addition to "compound" there are also "nutrients". These are something like groupings. In robokop, we would still map these to chemicals I think (at least for the most part). There are no external mappings for them, and my inclination is that we could just map them ourselves. Mostly to CHEBI but to MESH in some cases (like dietary fiber). The problem is that there is really not enough information for most of the nutrients in terms of a definition. There are things like "18:4". Which means that it's got 18 C and 4 double bonds, I think, but maybe it means more than that. Sometimes people use these shorthands for a specific kind of fatty acid (lineolic or whatever). Maybe it's common parlance, but I don't get it. Maybe we can talk to the Foodb folks to get more info about the nutrient meanings. If we did have that, we would also want to include them in synonymizing chemicals and getting chemical/food edges.
Compound / enzyme links are all coming from HMDB, which we pull separately so we can ignore those
There are flavors for compounds, which might be interesting, and which I think we would represent as properties on the chemicals. But I think it's very low priority for us.
There are health effects associated with compounds as well, which look like they are all chebi roles. Again, we're already bringing in that data, so we shouldn't need to get it from foodb.
Sparse data eval:
Compound table 28771 total records
Of those, the following are the number of records where the value is null: legacy kegg pubchem_compound pubchem_substance chebi het uniprot genbank wikipedia phenolexplorer dfc hmdb duke drugbank bigg eafus knapsack moldb flavornet goodscent superscent chembl chemspider meta_cyc phytohub 19331 6651 17854 5 2530 974 0 0 4515 77 15071 20772 6205 707 876 2964 4826 24399 745 2612 431 2820 14763 4621 0
or in percentage not populated in column: legacy kegg pubchem_compound pubchem_substance chebi het uniprot genbank wikipedia phenolexplorer dfc hmdb duke drugbank bigg eafus knapsack moldb flavornet goodscent superscent chembl chemspider meta_cyc phytohub 67% 23% 62% 0% 9% 3% 0% 0% 16% 3% 52% 72% 22% 2% 3% 10% 17% 85% 3% 9% 1% 0% 51% 16% 0%
What fields will Jim need to perform the omnicorp synonymization crawling? food ids, and the various names are readily available in the foods and contents tables.
but what synonyms should be gathered for each? FKs point to a number of other possible tables that can be leveraged. are we looking for alternate names (e.g. scientific names) and external data source IDs when available (like FoodComEx, pubchem, etc.)?
I cannot find a direct link from food <-> compound in the data. there are no foreign keys that would clearly identify an FK relationship.
there is a column named "public_id" in the column data that has 'FDB000001" data. so i made a presumption that that the "public_id" refers to the primary key in the foods table. However there doesnt seem to be be anything obvious in the data to confirm a relationship between the records. here is the query:
select f., ' ' as sep, c. from compounds c join foods f on concat('FDB', lpad(cast(f.id as char), 6, '0'))=c.public_id order by 1;
Here is the DB model diagram with what i could infer with FK relationships
More takeaways from the diagram above:
There may be a way to infer relationships between these 3 tables by using the external database ID columns shared by each. however, due to the sparsity of these external IDs these cross-references may not associate relationships comprehensively.
There's a join table, which is called "contents" . See the comment above that begins "The big relationship that we want is between foods and compounds"
well i totally missed that. sorry.
in my own defense tho, its a very poor design.
As you can see in the query pic i think we have an opportunity to use whatever external DB ID that we have to do the synonymization.
The 2 other pics is a rough outline of what i will do in code.
Can we talk tomorrow?
Also assign Kira once she has access to repo