PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

NCI60 - Several incorrect values in chem_name #236

Open jjacobson95 opened 1 week ago

jjacobson95 commented 1 week ago

The validation script is finding several errors in the NCI60 chem_name column in the drugs file. (This currently translates to be in broad_sanger_drugs.tsv)

[ERROR] [/tmp/nci60_drugs.tsv/2089] 300000000.0 is not of type 'string', 'null' in /chem_name
[ERROR][/tmp/nci60_drugs.tsv/189407] 200000000.0 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/196988] 4e+96 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/272367] 0.0 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/519798] 5052 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/520016] 138061 is not of type 'string', 'null' in /chem_name
sgosline commented 1 week ago

Do we know what drugs are causing this?

jjacobson95 commented 1 week ago

Out of this small list (the 6 errors above were all of the errors that the validation script found), only two could be found through this simple search.

Both mapped to the same drug: SMI_54937 , Pubchem ID 581 which can be found on pubchem. The values are not seen as identifiers here.

Screenshot 2024-10-21 at 9 22 12 AM
sgosline commented 1 week ago

Here is the REST call for that compound (according to the pubchem_retrieval.py script): https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/581/synonyms/JSON I dont see those coming up as chem_names.

I think these are somehow missed by the pubchem call and instead get added as NSC identifiers, but without the 'NSC' prefix. https://github.com/PNNL-CompBio/coderdata/blob/4025953f73d1285c6016166d8e87ad537652024c/build/broad_sanger/03a-nci60Drugs.py#L92

jjacobson95 commented 4 days ago

I have been working on the schema checker and the previous errors were truncated. This issue is also in the other datasets that use these drug. This may help with tracking down the issue.

[ERROR] [/tmp/prism_drugs.tsv/19740] 4e+96 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/prism_drugs.tsv/50270] 5052 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/prism_drugs.tsv/50305] 138061 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/gdscv1_drugs.tsv/5103] 4e+96 is not of type 'string', 'null' in /chem_name