NCATS-Tangerine / ncats-ingest

Management of ingestion of sources for NCATS-translator
2 stars 2 forks source link

Ingest GINAS #1

Open cmungall opened 7 years ago

cmungall commented 7 years ago

Example: https://ginas.ncats.nih.gov/ginas/app/substance/4372d367

TomConlin commented 7 years ago
curl -s https://ginas.ncats.nih.gov/ginas/app/api/v1 > v1.json

for url in $(jq  -r '.[].href' v1.json);do  echo ${url};curl ${url} > ${url##*/}.json; done
#for url in $(jq  -r '.[].href' v1.json);do wget ${url};done 

backups.json
codes.json
edits.json
jobs.json
keywords.json 
names.json
payload.jsono
references.json
structures.json
substances.json
v1.json
values.json
vocabularies.json
xrefs.json

Which look alot like relational table names
each .json is structured as a partial results page (default 10?) with pointers to next/previous pages
each has a contents structure with the attributes I expect are column names .

for j in *.json; do 
    echo ${j};
    jq -r '.content[]|keys' "${j}" |\
        tr -d '\n' | tr \] '\n' | tr -d \" | sed 's/^\[//g' | sort -u ;\
    echo "";
done

########################################################

backups.json
  created,  deprecated,  id,  kind,  modified,  version

codes.json
  _self,  access,  code,  codeSystem,  comments,  created,  createdBy,  deprecated,  lastEdited,  lastEditedBy,  references,  type,  url,  uuid
  _self,  access,  code,  codeSystem,             created,  createdBy,  deprecated,  lastEdited,  lastEditedBy,  references,  type,  url,  uuid
  _self,  access,  code,  codeSystem,             created,  createdBy,  deprecated,  lastEdited,  lastEditedBy,  references,  type,        uuid

edits.json
  created,  diff,  editor,  kind,  newValue,  oldValue,  refid
  created,  diff,  editor,  kind,  newValue,  oldValue,  refid,  version

jobs.json
  _owner,  _payload,  id,  keys,  message,  name,  start,  statistics,  status,  stop,  version

keywords.json
  id,  term

names.json
  _self,  access,  created,  createdBy,  deprecated,  displayName,  domains,  languages,  lastEdited,  lastEditedBy,  name,  nameJurisdiction,  nameOrgs,  preferred,  references,  type,  uuid

payload.json
  created,  id,  mimeType,  name,  properties,  sha1,  size

references.json
  _self,  access,  citation,  created,  createdBy,  deprecated,  docType,  documentDate,  id,  lastEdited,  lastEditedBy,  publicDomain,  tags,  uuid
  _self,  access,  citation,  created,  createdBy,  deprecated,  docType,  documentDate,       lastEdited,  lastEditedBy,  publicDomain,  tags,  uuid
  _self,  access,  citation,  created,  createdBy,  deprecated,  docType,                      lastEdited,  lastEditedBy,  publicDomain,  tags,  uuid

structures.json
  _properties,  access,  atropisomerism,  charge,  count,  created,  createdBy,  definedStereo,  deprecated,  digest,  ezCenters,  formula,  hash,  id,  lastEdited,  lastEditedBy,  molfile,  mwt,  opticalActivity,  references,  self,  smiles,  stereoCenters,  stereoComments,  stereochemistry

substances.json
  _approvalIDDisplay,  _codes,  _moieties,  _name,  _names,                _references,                   _self,  access,  approvalID,  approved,  approvedBy,  created,  createdBy,  definitionLevel,  definitionType,  deprecated,  lastEdited,  lastEditedBy,                            status,  structure,            substanceClass,  uuid,  version
  _approvalIDDisplay,  _codes,              _name,  _names,  _properties,  _references,                   _self,  access,  approvalID,  approved,  approvedBy,  created,  createdBy,  definitionLevel,  definitionType,  deprecated,  lastEdited,  lastEditedBy,  modifications,  protein,  status,                        substanceClass,  uuid,  version
  _approvalIDDisplay,  _codes,              _name,  _names,                _references,  _relationships,  _self,  access,  approvalID,  approved,  approvedBy,  created,  createdBy,  definitionLevel,  definitionType,  deprecated,  lastEdited,  lastEditedBy,  modifications,            status,  structurallyDiverse,  substanceClass,  uuid,  version
  _approvalIDDisplay,                       _name,  _names,                _references,  _relationships,  _self,  access,  approvalID,  approved,  approvedBy,  created,  createdBy,  definitionLevel,  definitionType,  deprecated,  lastEdited,  lastEditedBy,  modifications,            status,  structurallyDiverse,  substanceClass,  uuid,  version

values.json
  id,  term

vocabularies.json
  created,  deprecated,  domain,  editable,  fields,  filterable,  id,  modified,  terms,  version,  vocabularyTermType

xrefs.json
  created,  deprecated,  href,  id,  kind,  modified,  properties,  refid,  version

Here I have listed the distinct column patterns available in the default result blob
this is not expected to be a full list of columns as there may be some which did not occur in the first default set of rows returned (expect 10 rows have not confirmed)

We could slog through reconstituting the foreign keys but that is not as fun as it use to be ...

So, as of evening of Dec 13, I have an email in to info@ncats.nih.gov concluding with:

"Is there a database schema or rest api documentation (or person) I could consult? "
cmungall commented 7 years ago

Seems to follow a kind of https://en.wikipedia.org/wiki/HATEOAS / HAL type structures - make it quite easy to browse the API in a browser (with JSON pretty printing enabled)

I believe you should be able to get most of what is needed from individual URLs e.g.

https://ginas.ncats.nih.gov/ginas/app/api/v1/substances(00006eea-e2d2-4d79-99ff-30f17b3dd740)?view=full

Let's get started with this. And remember, if we go the dipper route we can essentially delay binding, i.e. you can output the triples in any order, so long as things are joined by URIs it all matches up in the graph.

newgene commented 7 years ago

They have their own uuid as the substance id, can we get the list of these ids?

Do we have a point person in their team to get those details? Or we should just use info@ncats.nih.gov?

TomConlin commented 7 years ago

I am coming up on a full week of no response from info@ncats. It may just be the season but an alternative contact would be welcome.

I pulled down their app which has an embedded h2 database
https://tripod.nih.gov/ginas/#/gsrs/release Hope to get some time to explore it soon.

newgene commented 7 years ago

Per Mark's email, this is the latest dump of GSRS:

https://tripod.nih.gov/ginas/#/gsrs/data

looks like one JSON per row, and 84979 rows in total.

TomConlin commented 7 years ago

here is a first approximation of a ginas json structure: plural record names may indicate an array of records associated with RECORD.
ginas_record

jmcmurry commented 7 years ago

Hi there; just resurfacing. Any update? Blockers, etc?

TomConlin commented 7 years ago

no update from me, my attention has turned to a monarch data release If someone wanted to take a pass at identifying the subset of items we need I would isolate them when my attention turns back.

mellybelly commented 7 years ago

seems like there are different types of entities in GINAS. Some are more simple, for example a single drug entity, and have for example an INCHI key in common, others may be mixtures or other types of entities without a common Identifier across them all. We want to find the identifiers that work best to bridge, but at same time, might focus on the simpler ones first.

mellybelly commented 7 years ago

Also, the modeling needs to be implemented in DIPPER/Wikidata that would relate components to parent mixtures, to adverse events, etc.

stuppie commented 7 years ago

Number of each class of item $ zcat fullSeedData-2016-06-16_cut.json.gz | jq -r '.substanceClass' | sort | uniq -c | sort -k1nr

  56543  chemical
  17288  structurallyDiverse
   6902  concept
   1925  mixture
   1309  polymer
    999  protein
     13  nucleicAcid
stuppie commented 7 years ago

Counts of number of Items by external ID

CAS                                       55799
CFR                                        1538
CODEX ALIMENTARIUS (GSFA)                   267
DEA NO.                                     462
DRUG BANK                                  1769
ECHA (EC/EINECS)                          20376
EMA ASSESSMENT REPORTS                      355
EPA PESTICIDE CODE                         2534
EVMPD                                      5201
Food Contact Sustance Notif, (FCN No.)      264
INN                                        8001
IUPHAR                                     1104
JECFA EVALUATION                           1828
JECFA MONOGRAPH                             252
LIVERTOX                                    923
MERCK INDEX                               14767
MESH                                      10365
NCI_THESAURUS                             10673
NDF-RT                                     1119
RXCUI                                      6663
UNII                                      58464
WHO INTERNATIONAL PHARMACPOEIA              411
WHO-ATC                                    2719
WHO-ESSENTIAL MEDICINES LIST                288
WHO-VATC                                   2909
WIKIPEDIA                                  7172
names                                     58464
preferred_names                           58464
substanceClass                            58464
SMILES                                    56539

fullSeedData-2016-06-16_drugIDs.csv.gz

stuppie commented 7 years ago

Number of items mapped to ChEBI using CAS Registry Number: 9,963 (Out of 54427 items)

fullSeedData-2016-06-16_chebicas.csv.gz

cmungall commented 7 years ago

How are you mapping CAS to CHEBI?

e.g. 126-19-2 lacks a CHEBI ID (line 37 of your csv):

35      126-19-2                        204-776-3                       9639.0                  M9787                                   CFS802C28F           SARSASAPOGENIN   SARSASAPOGENIN|(3.BETA.,5.BETA.,25S)-SPIROSTAN-3-OL|PARIGENIN|SPIROSTAN-3-OL, (3.BETA.,5.BETA.,25S)-|PURE SARSASAPOGENIN|SARSAGENIN|سارساجينين|sarsagénine|sarsageninum|沙赛吉宁|sarsagenina|сарсагенин       SARSAGENIN      chemical        0022a04d-cdb0-479d-b268-1b42d1843a1f    C[C@H]1[C@H]2[C@H](C[C@H]3[C@@H]4CC[C@@H]5C[C@@H](O)CC[C@]5(C)[C@H]4CC[C@]23C)O[C@]16CC[C@H](C)CO6    

Yet there is a CHEBI entry with this registry number http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15578

stuppie commented 7 years ago

It's not in biothings :( Not sure if this link will work, but here and attached

(what it would look like with a CAS attached)

I'm not sure why its not there... will have to talk to Julee and @newgene

newgene commented 7 years ago

Here is the parser from @jadesara for CHEBI:

https://github.com/SuLab/biothings.drugs/tree/master/src/dataload/contrib/chebi

It's parsed from ChEBI_complete.sdf.gz file (ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/). Does not look like CAS 126-19-2 is there, but "CHEBI:15578" record exists. sigh...

stuppie commented 7 years ago

Was just about to post the same thing..! It is not in that flat file :/ It is however in the obo file

[Term]
id: CHEBI:15578
name: (25S)-5beta-spirostan-3beta-ol
alt_id: CHEBI:178
alt_id: CHEBI:10854
alt_id: CHEBI:18537
subset: 3_STAR
synonym: "(3beta,5beta,25S)-spirostan-3-ol" RELATED [ChemIDplus]
...
xref: KNApSAcK:C00003590 
xref: Beilstein:91757 "Beilstein"
xref: KEGG:C03963 
xref: CAS:126-19-2 "KEGG COMPOUND"
xref: CAS:126-19-2 "ChemIDplus"
is_a: CHEBI:26606
stuppie commented 7 years ago

Mapping CAS and UNII to wikidata ID (using pubchem as data source) https://github.com/stuppie/ncats-ingest/blob/master/ginas/map_to_mydrug/map_chemicals_to_wikidata.ipynb

stuppie commented 7 years ago

1,751 ginas compounds with either

stuppie commented 7 years ago

Example ginas doc that will be in mydrug. Almost exactly the same format as from ginas except with a couple added fields for readability: preferred_name, cas_primary, xrefs, names_list

https://raw.githubusercontent.com/stuppie/biothings.drugs/master/data/ginas_ascordbic_acid.json

Tip: Check out the Json formatter chrome extension to view and interact with (with collapsible trees) json in your browser

jmcmurry commented 7 years ago

Proposed stepwise execution plan

jmcmurry commented 7 years ago

Regarding identification of relevant subset, what is the critical path? Is it adequate to have Biothings Schema together with a glance at the GINAS source data? Or do we need the actual ingest and indexing to be complete?