NCATS-Gamma / robokop

Master UI for ROBOKOP
MIT License
16 stars 3 forks source link

Research/review FooDB #346

Closed bryttanytodd closed 5 years ago

bryttanytodd commented 5 years ago

Also assign Kira once she has access to repo

cbizon commented 5 years ago

FooDB uses its own identifiers for compounds. In ROBOKOP/biolink these are called "chemical_substance" and we need to figure out the synonyms between these identifiers and the other identifiers in the system.

FooDB includes columns/associations for CHEMBL, KEGG, PUBCHEM, CHEBI, DrugBank, HMDB (it has others, but these are the ones that we care about). The problem is that the relationships are incomplete. For instance: http://foodb.ca/compounds/FDB000014 has no external identifiers. But by searching on inchikey, we find that it's the same compound as https://pubchem.ncbi.nlm.nih.gov/compound/101209468.

Our normal method for comparing chemicals to find ones with the same structure is to use unichem. The problem is that unichem doesn't include foodb in its sources. I can see a few possible methods going forward:

  1. Convince UniChem to pickup foodb as a source and let them handle it. Best long term, but unlikely to happen
  2. Use the PubChem web service to pull a list of all pubchem inchikeys and use that to match inchikeys from foodb. PubChem will probably be the place to find most of the misfits, but it's possible that when we're done, we'll find that there are a few compounds that don't have a pubchem, but are maybe in chebi etc.
  3. Say that any compound that foodb didn't tell us an id for won't synonymize. These are going to be mostly compounds that don't have a lot of other data for them, so it wouldn't hurt too much. But it's really only acceptable as a waiting state if (1) is happening.
cbizon commented 5 years ago

Food is a new node type in robokop. There are a few issues:

  1. I don't know if there is a food semantic type in the biolink model. If not, we might need to request one.

  2. There are two identifier schemes for food: FooDB and FOODON. There does not appear to be any mapping between the two, so we can't do any meaningful synonymization as far as I can tell. I think for now, that we'll want to just use FooDB identifiers, nullop synonymization, and try to get somebody interested in fixing this. We'll also need to give a list of food identfiers, names and synonyms to Jim for Omnicorp crawling. (FOODON is being used, but again, no mapping).

cbizon commented 5 years ago

There is not a huge amount of information about foods that we might want to use as a property. There's a long text string (which I think is better gotten to with a linkout). There is a Group and a Subgroup that might make nice properties though, and we might think about adding an annotator for foods to get them.

PhillipsOwen commented 5 years ago

This is my evaluation of the data and the changes I had to make to get it into a smartBag.

Some files have non UTF-8 data: • Usage of the angstrom char caused me some angst. • Numerous other chars that had no mapping to UTF-8. • Switched to ISO-8859-1 and that mapped those chars correctly.

Header column count to data record element count is off in the following files (file: record count): • Compound_synonyms.csv: 2. • Contents.csv: 78599. • Foodcomex_compounds.csv: 1122. • Foods.csv: 907. • References.csv: 1.

One of the file names was a SQLite reserved word: • Changed “references” to “references1” when being used as a table name.

One of the column names was a SQLite reserved word in References.csv: • Changed “text” to “text1”.

Foods.csv had a duplicate column name: • 1 instance of the header column item “Wikipedia_id” removed.

Huge number of records in contents.csv have all null columns except for an ID.

cbizon commented 5 years ago

The big relationship that we want is between foods and compounds. (The stuff in the composition section of the chemical page). It looks like it's the "Compound" rows of contents.csv. We'll want to have food->chem and chem-> food calls, each of which returns edges linking the two. There are a few complications:

  1. There is a numerical value (the amount of the compound in the food) which we want to capture as an edge property, but we need to make sure the units are uniform.

  2. There are some cases where in the website it says "detected but not quantified", and we'll want a way to represent that, maybe just by leaving the quantity off of the edge.

  3. The contents are dependent on the food preparation. So for instance on Pak Choi http://foodb.ca/foods/252 there are contents that can be subset into "raw" or "cooked". I think we want to make a new edge for each preparation, which the prep as a property?

  4. It's possible that there are multiple rows for a given preparation, if there are multiple sources providing the data. In that case, we should take the average.

  5. There are references given for each row, but they're not pubmed ids. Instead they are links to some other databases and sources.

cbizon commented 5 years ago

In addition to "compound" there are also "nutrients". These are something like groupings. In robokop, we would still map these to chemicals I think (at least for the most part). There are no external mappings for them, and my inclination is that we could just map them ourselves. Mostly to CHEBI but to MESH in some cases (like dietary fiber). The problem is that there is really not enough information for most of the nutrients in terms of a definition. There are things like "18:4". Which means that it's got 18 C and 4 double bonds, I think, but maybe it means more than that. Sometimes people use these shorthands for a specific kind of fatty acid (lineolic or whatever). Maybe it's common parlance, but I don't get it. Maybe we can talk to the Foodb folks to get more info about the nutrient meanings. If we did have that, we would also want to include them in synonymizing chemicals and getting chemical/food edges.

cbizon commented 5 years ago

Compound / enzyme links are all coming from HMDB, which we pull separately so we can ignore those

cbizon commented 5 years ago

There are flavors for compounds, which might be interesting, and which I think we would represent as properties on the chemicals. But I think it's very low priority for us.

cbizon commented 5 years ago

There are health effects associated with compounds as well, which look like they are all chebi roles. Again, we're already bringing in that data, so we shouldn't need to get it from foodb.

PhillipsOwen commented 5 years ago

Sparse data eval:

Compound table 28771 total records

Of those, the following are the number of records where the value is null: legacy kegg pubchem_compound pubchem_substance chebi het uniprot genbank wikipedia phenolexplorer dfc hmdb duke drugbank bigg eafus knapsack moldb flavornet goodscent superscent chembl chemspider meta_cyc phytohub 19331 6651 17854 5 2530 974 0 0 4515 77 15071 20772 6205 707 876 2964 4826 24399 745 2612 431 2820 14763 4621 0

or in percentage not populated in column: legacy kegg pubchem_compound pubchem_substance chebi het uniprot genbank wikipedia phenolexplorer dfc hmdb duke drugbank bigg eafus knapsack moldb flavornet goodscent superscent chembl chemspider meta_cyc phytohub 67% 23% 62% 0% 9% 3% 0% 0% 16% 3% 52% 72% 22% 2% 3% 10% 17% 85% 3% 9% 1% 0% 51% 16% 0%

PhillipsOwen commented 5 years ago

What fields will Jim need to perform the omnicorp synonymization crawling? food ids, and the various names are readily available in the foods and contents tables.

but what synonyms should be gathered for each? FKs point to a number of other possible tables that can be leveraged. are we looking for alternate names (e.g. scientific names) and external data source IDs when available (like FoodComEx, pubchem, etc.)?

PhillipsOwen commented 5 years ago

I cannot find a direct link from food <-> compound in the data. there are no foreign keys that would clearly identify an FK relationship.

there is a column named "public_id" in the column data that has 'FDB000001" data. so i made a presumption that that the "public_id" refers to the primary key in the foods table. However there doesnt seem to be be anything obvious in the data to confirm a relationship between the records. here is the query:

select f., ' ' as sep, c. from compounds c join foods f on concat('FDB', lpad(cast(f.id as char), 6, '0'))=c.public_id order by 1;

PhillipsOwen commented 5 years ago

Here is the DB model diagram with what i could infer with FK relationships FooDB (MySQL)

PhillipsOwen commented 5 years ago

More takeaways from the diagram above:

There may be a way to infer relationships between these 3 tables by using the external database ID columns shared by each. however, due to the sparsity of these external IDs these cross-references may not associate relationships comprehensively.

cbizon commented 5 years ago

There's a join table, which is called "contents" . See the comment above that begins "The big relationship that we want is between foods and compounds"

PhillipsOwen commented 5 years ago

well i totally missed that. sorry.

in my own defense tho, its a very poor design.

PhillipsOwen commented 5 years ago

As you can see in the query pic i think we have an opportunity to use whatever external DB ID that we have to do the synonymization.

The 2 other pics is a rough outline of what i will do in code.

Can we talk tomorrow? foods-compounds-contents join Food to chemical compound crawler Perform food to chemical compound