Open cmungall opened 7 years ago
curl -s https://ginas.ncats.nih.gov/ginas/app/api/v1 > v1.json
for url in $(jq -r '.[].href' v1.json);do echo ${url};curl ${url} > ${url##*/}.json; done
#for url in $(jq -r '.[].href' v1.json);do wget ${url};done
backups.json
codes.json
edits.json
jobs.json
keywords.json
names.json
payload.jsono
references.json
structures.json
substances.json
v1.json
values.json
vocabularies.json
xrefs.json
Which look alot like relational table names
each .json is structured as a partial results page (default 10?) with pointers to next/previous pages
each has a contents
structure with the attributes I expect are column names .
for j in *.json; do
echo ${j};
jq -r '.content[]|keys' "${j}" |\
tr -d '\n' | tr \] '\n' | tr -d \" | sed 's/^\[//g' | sort -u ;\
echo "";
done
########################################################
backups.json
created, deprecated, id, kind, modified, version
codes.json
_self, access, code, codeSystem, comments, created, createdBy, deprecated, lastEdited, lastEditedBy, references, type, url, uuid
_self, access, code, codeSystem, created, createdBy, deprecated, lastEdited, lastEditedBy, references, type, url, uuid
_self, access, code, codeSystem, created, createdBy, deprecated, lastEdited, lastEditedBy, references, type, uuid
edits.json
created, diff, editor, kind, newValue, oldValue, refid
created, diff, editor, kind, newValue, oldValue, refid, version
jobs.json
_owner, _payload, id, keys, message, name, start, statistics, status, stop, version
keywords.json
id, term
names.json
_self, access, created, createdBy, deprecated, displayName, domains, languages, lastEdited, lastEditedBy, name, nameJurisdiction, nameOrgs, preferred, references, type, uuid
payload.json
created, id, mimeType, name, properties, sha1, size
references.json
_self, access, citation, created, createdBy, deprecated, docType, documentDate, id, lastEdited, lastEditedBy, publicDomain, tags, uuid
_self, access, citation, created, createdBy, deprecated, docType, documentDate, lastEdited, lastEditedBy, publicDomain, tags, uuid
_self, access, citation, created, createdBy, deprecated, docType, lastEdited, lastEditedBy, publicDomain, tags, uuid
structures.json
_properties, access, atropisomerism, charge, count, created, createdBy, definedStereo, deprecated, digest, ezCenters, formula, hash, id, lastEdited, lastEditedBy, molfile, mwt, opticalActivity, references, self, smiles, stereoCenters, stereoComments, stereochemistry
substances.json
_approvalIDDisplay, _codes, _moieties, _name, _names, _references, _self, access, approvalID, approved, approvedBy, created, createdBy, definitionLevel, definitionType, deprecated, lastEdited, lastEditedBy, status, structure, substanceClass, uuid, version
_approvalIDDisplay, _codes, _name, _names, _properties, _references, _self, access, approvalID, approved, approvedBy, created, createdBy, definitionLevel, definitionType, deprecated, lastEdited, lastEditedBy, modifications, protein, status, substanceClass, uuid, version
_approvalIDDisplay, _codes, _name, _names, _references, _relationships, _self, access, approvalID, approved, approvedBy, created, createdBy, definitionLevel, definitionType, deprecated, lastEdited, lastEditedBy, modifications, status, structurallyDiverse, substanceClass, uuid, version
_approvalIDDisplay, _name, _names, _references, _relationships, _self, access, approvalID, approved, approvedBy, created, createdBy, definitionLevel, definitionType, deprecated, lastEdited, lastEditedBy, modifications, status, structurallyDiverse, substanceClass, uuid, version
values.json
id, term
vocabularies.json
created, deprecated, domain, editable, fields, filterable, id, modified, terms, version, vocabularyTermType
xrefs.json
created, deprecated, href, id, kind, modified, properties, refid, version
Here I have listed the distinct column patterns available in the default result blob
this is not expected to be a full list of columns as there may be some which did not occur in the first default set of rows returned (expect 10 rows have not confirmed)
We could slog through reconstituting the foreign keys but that is not as fun as it use to be ...
So, as of evening of Dec 13, I have an email in to info@ncats.nih.gov concluding with:
"Is there a database schema or rest api documentation (or person) I could consult? "
Seems to follow a kind of https://en.wikipedia.org/wiki/HATEOAS / HAL type structures - make it quite easy to browse the API in a browser (with JSON pretty printing enabled)
I believe you should be able to get most of what is needed from individual URLs e.g.
Let's get started with this. And remember, if we go the dipper route we can essentially delay binding, i.e. you can output the triples in any order, so long as things are joined by URIs it all matches up in the graph.
They have their own uuid as the substance id, can we get the list of these ids?
Do we have a point person in their team to get those details? Or we should just use info@ncats.nih.gov?
I am coming up on a full week of no response from info@ncats. It may just be the season but an alternative contact would be welcome.
I pulled down their app which has an embedded h2 database
https://tripod.nih.gov/ginas/#/gsrs/release
Hope to get some time to explore it soon.
Per Mark's email, this is the latest dump of GSRS:
https://tripod.nih.gov/ginas/#/gsrs/data
looks like one JSON per row, and 84979 rows in total.
here is a first approximation of a ginas json structure:
plural record names may indicate an array of records associated with RECORD
.
Hi there; just resurfacing. Any update? Blockers, etc?
no update from me, my attention has turned to a monarch data release If someone wanted to take a pass at identifying the subset of items we need I would isolate them when my attention turns back.
seems like there are different types of entities in GINAS. Some are more simple, for example a single drug entity, and have for example an INCHI key in common, others may be mixtures or other types of entities without a common Identifier across them all. We want to find the identifiers that work best to bridge, but at same time, might focus on the simpler ones first.
Also, the modeling needs to be implemented in DIPPER/Wikidata that would relate components to parent mixtures, to adverse events, etc.
Number of each class of item $ zcat fullSeedData-2016-06-16_cut.json.gz | jq -r '.substanceClass' | sort | uniq -c | sort -k1nr
56543 chemical
17288 structurallyDiverse
6902 concept
1925 mixture
1309 polymer
999 protein
13 nucleicAcid
Counts of number of Items by external ID
CAS 55799
CFR 1538
CODEX ALIMENTARIUS (GSFA) 267
DEA NO. 462
DRUG BANK 1769
ECHA (EC/EINECS) 20376
EMA ASSESSMENT REPORTS 355
EPA PESTICIDE CODE 2534
EVMPD 5201
Food Contact Sustance Notif, (FCN No.) 264
INN 8001
IUPHAR 1104
JECFA EVALUATION 1828
JECFA MONOGRAPH 252
LIVERTOX 923
MERCK INDEX 14767
MESH 10365
NCI_THESAURUS 10673
NDF-RT 1119
RXCUI 6663
UNII 58464
WHO INTERNATIONAL PHARMACPOEIA 411
WHO-ATC 2719
WHO-ESSENTIAL MEDICINES LIST 288
WHO-VATC 2909
WIKIPEDIA 7172
names 58464
preferred_names 58464
substanceClass 58464
SMILES 56539
Number of items mapped to ChEBI using CAS Registry Number: 9,963 (Out of 54427 items)
How are you mapping CAS to CHEBI?
e.g. 126-19-2 lacks a CHEBI ID (line 37 of your csv):
35 126-19-2 204-776-3 9639.0 M9787 CFS802C28F SARSASAPOGENIN SARSASAPOGENIN|(3.BETA.,5.BETA.,25S)-SPIROSTAN-3-OL|PARIGENIN|SPIROSTAN-3-OL, (3.BETA.,5.BETA.,25S)-|PURE SARSASAPOGENIN|SARSAGENIN|سارساجينين|sarsagénine|sarsageninum|沙赛吉宁|sarsagenina|сарсагенин SARSAGENIN chemical 0022a04d-cdb0-479d-b268-1b42d1843a1f C[C@H]1[C@H]2[C@H](C[C@H]3[C@@H]4CC[C@@H]5C[C@@H](O)CC[C@]5(C)[C@H]4CC[C@]23C)O[C@]16CC[C@H](C)CO6
Yet there is a CHEBI entry with this registry number http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15578
Here is the parser from @jadesara for CHEBI:
https://github.com/SuLab/biothings.drugs/tree/master/src/dataload/contrib/chebi
It's parsed from ChEBI_complete.sdf.gz file (ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/). Does not look like CAS 126-19-2 is there, but "CHEBI:15578" record exists. sigh...
Was just about to post the same thing..! It is not in that flat file :/ It is however in the obo file
[Term]
id: CHEBI:15578
name: (25S)-5beta-spirostan-3beta-ol
alt_id: CHEBI:178
alt_id: CHEBI:10854
alt_id: CHEBI:18537
subset: 3_STAR
synonym: "(3beta,5beta,25S)-spirostan-3-ol" RELATED [ChemIDplus]
...
xref: KNApSAcK:C00003590
xref: Beilstein:91757 "Beilstein"
xref: KEGG:C03963
xref: CAS:126-19-2 "KEGG COMPOUND"
xref: CAS:126-19-2 "ChemIDplus"
is_a: CHEBI:26606
Mapping CAS and UNII to wikidata ID (using pubchem as data source) https://github.com/stuppie/ncats-ingest/blob/master/ginas/map_to_mydrug/map_chemicals_to_wikidata.ipynb
1,751 ginas compounds with either
Example ginas doc that will be in mydrug. Almost exactly the same format as from ginas except with a couple added fields for readability: preferred_name, cas_primary, xrefs, names_list
https://raw.githubusercontent.com/stuppie/biothings.drugs/master/data/ginas_ascordbic_acid.json
Tip: Check out the Json formatter chrome extension to view and interact with (with collapsible trees) json in your browser
Proposed stepwise execution plan
Regarding identification of relevant subset, what is the critical path? Is it adequate to have Biothings Schema together with a glance at the GINAS source data? Or do we need the actual ingest and indexing to be complete?
Example: https://ginas.ncats.nih.gov/ginas/app/substance/4372d367