Open jjacobson95 opened 1 week ago
Do we know what drugs are causing this?
Out of this small list (the 6 errors above were all of the errors that the validation script found), only two could be found through this simple search.
Both mapped to the same drug: SMI_54937 , Pubchem ID 581 which can be found on pubchem. The values are not seen as identifiers here.
Here is the REST call for that compound (according to the pubchem_retrieval.py script): https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/581/synonyms/JSON I dont see those coming up as chem_names.
I think these are somehow missed by the pubchem call and instead get added as NSC identifiers, but without the 'NSC' prefix. https://github.com/PNNL-CompBio/coderdata/blob/4025953f73d1285c6016166d8e87ad537652024c/build/broad_sanger/03a-nci60Drugs.py#L92
I have been working on the schema checker and the previous errors were truncated. This issue is also in the other datasets that use these drug. This may help with tracking down the issue.
[ERROR] [/tmp/prism_drugs.tsv/19740] 4e+96 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/prism_drugs.tsv/50270] 5052 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/prism_drugs.tsv/50305] 138061 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/gdscv1_drugs.tsv/5103] 4e+96 is not of type 'string', 'null' in /chem_name
The validation script is finding several errors in the NCI60 chem_name column in the drugs file. (This currently translates to be in broad_sanger_drugs.tsv)