Knowledge-Graph-Hub / kg-idg

A Knowledge Graph to Illuminate the Druggable Genome
https://knowledge-graph-hub.github.io/kg-idg/
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

Parse the DrugCentral 'structures' table to get more metadata #72

Closed caufieldjh closed 2 years ago

caufieldjh commented 2 years ago

The comprehensive collection of drug-specific metadata (including all names and availabilty of formulations) is in the structures table of the DrugCentral dump, but it's challenging to parse because some it is values are full structure descriptions, complete with newlines. This PR will handle parsing of this table.

This PR also includes manual updates to the SSSOM DrugCentral CURIE map, from extracting the structure IDs + names from the PostgreSQL DB as:

psql -d temp -c "COPY ( SELECT cd_id, id, name FROM structures ) TO stdout" > dump.tsv

The id field is what's referred to as struct_id elsewhere.

caufieldjh commented 2 years ago

The element molimg in the structures table is an image, so when the csv parser hits it, it throws (through Koza) a field larger than field limit error. Or, this could be due to the double quotes surrounding the molfile element, but I think it's the first one. (It was the first one.)

sonarcloud[bot] commented 2 years ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

70.4% 70.4% Coverage
0.0% 0.0% Duplication