CalebBell / chemicals

chemicals: Chemical database of Chemical Engineering Design Library (ChEDL)
MIT License
179 stars 36 forks source link

incorrect separators on water synonyms #29

Open longemen3000 opened 3 years ago

longemen3000 commented 3 years ago

What is the search string caustic soda liquid;aquafina;distilled water;hydrogen oxide (h2o);ultrexii ultrapure; Which chemical in the database do you believe should be found? its water,but the separators here are wrong

CalebBell commented 3 years ago

Hi Andrés, Like all software not maintained, bits and pieces of the chemicals-metadata repository have rotted away. I cannot get the inchi module in rdkit to work for me, and I am having issues building rdkit. Thanks for letting me know about the issue. I'm afraid we may have to manually patch the file for now. Sincerely, Caleb

CalebBell commented 3 years ago

Hi Andrés, I found a version of rdkit which works on linux - and it's on pypi! One step closer to being able to update the database again. I think I actually need to port chemical-metadata to Python 3 as well.

Sincerely, Caleb

longemen3000 commented 3 years ago

what do you think of adding ; as an aditional separator? the main problem would checking if other names actually have ; as part of their name. maybe adding:

line = line.replace(';','\t')

before this line https://github.com/CalebBell/chemicals/blob/c5b1014d42216eaa93bf2fd46aec2d35beb82b8e/chemicals/identifiers.py#L370 could solve the problem temporally?

Also, i noticed (by a quick view, nothing exhaustive) that those synonyms separated by ';' are always at the end of the list.

Edit: the split ; must always be done after parsing the InChI

CalebBell commented 3 years ago

Hi Andrés, I have fixed the chemical-metadata repository a lot, and generated a new inorganic file without this particular issue. I attached it.

What is hard to do is that the online data has changed so much, I can't even use a diff program to see what changed. Because of that, it's hard to replace the current file with the new one. Do you want to look at it?

Sincerely, Caleb

Inorganic db.csv

longemen3000 commented 3 years ago

Hi Caleb,

Given the old and new versions, i could program a manual diff to see what's changed, I'm gonna start with this and let you know what I found.

longemen3000 commented 3 years ago

for a preliminar parsing: there are more synonyms, compared to the old database:

Old

julia> CC.load_db!(:inorganic_old2)
[ Info: :inorganic_old2 arrow file not generated, processing...
syms_i = 6326 #amount of synonyms
syms_unique  = 6325 # unique elements (there is one element repeated that i have yet find)
(Arrow.Table with 153 rows, 9 columns, and schema:
.....

New

julia> CC.load_db!(:inorganic_new)
[ Info: :inorganic_new database file not found, downloading from https://github.com/CalebBell/chemicals/files/6912649/Inorganic.db.csv       
[ Info: :inorganic_new database file downloaded.
[ Info: :inorganic_new arrow file not generated, processing...
syms_i = 9461
syms_unique = 9438
(Arrow.Table with 164 rows, 9 columns, and schema:

comparing the differences, by InChI:

InChI contained in the old database, not present in the new database

  "InChI=1S/CH2.Co/h1H2;/q-1;+1"
  "InChI=1S/Cr.2H2Si/h;2*1H2"
  "InChI=1S/H4Si/h1H4"
  "InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4…  "InChI=1S/F6Si.2H3N/c1-7(2,3,4,5)6;;/h;2*1H3…  "InChI=1S/Bi.2ClH.2H/h;2*1H;;/q+2;;;;/p-2"
  "InChI=1S/Al.Na.2O.2H/q-1;+1;;;;"
  "InChI=1S/BrHO3.Cs/c2-1(3)4;/h(H,2,3,4);/q;+…  "InChI=1S/2Na.H3O4P/c;;1-5(2,3)4/h;;(H3,1,2,…  "InChI=1S/2BH2.Ti/h2*1H2;"
  "InChI=1S/F6Si.2Na/c1-7(2,3,4,5)6;;/q-2;2*+1"  ""
  "InChI=1S/2Na.3H2O4S/c;;3*1-5(2,3)4/h;;3*(H2…

InChI contained in the new database, not present in the old database

  "InChI=1S/Cl2S2/c1-3-4-2"
  "InChI=1S/O.Pr"
  "InChI=1S/Bi.2ClH/h;2*1H/q+2;;/p-2"
  "InChI=1S/Cr.2Si"
  "InChI=1S/C32H16N8.Cu/c1-2-10-18-17(9-1)25-33-26(18)38-28-21-13-5-6-14-22(21)30(35-28)40-32-24-…  "InChI=1S/Al.Na.2O/q-1;+1;;"
  "InChI=1S/Al.La.O"
  "InChI=1S/2B.Ti"
  "InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4)"
  "InChI=1S/C.Co/q-1;+1"
  "InChI=1S/3O.2Yb/q3*-2;2*+3"
  "InChI=1S/2HI.Sm/h2*1H;/q;;+2/p-2"
  "InChI=1S/3ClH.Ru/h3*1H;/q;;;+3/p-3"
  "InChI=1S/H2O/h1H2"
  "InChI=1S/2B.Zr"
  "InChI=1S/10CO.2Re/c10*1-2;;"
  "InChI=1S/H3NO.H2O4S/c1-2;1-5(2,3)4/h2H,1H2;(H2,1,2,3,4)"
  "InChI=1S/Li.H"
  "InChI=1S/Na.H2O4S/c;1-5(2,3)4/h;(H2,1,2,3,4)"
  "InChI=1S/C.2W/q+1;;-1"
  "InChI=1S/6Al.2O2Si.9O/c;;;;;;2*1-3-2;;;;;;;;;"
  "InChI=1S/B.Li.O"
  "InChI=1S/Cd.2FH/h;2*1H/q+2;;/p-2"
longemen3000 commented 3 years ago

doing the same thing with the formulas:

julia> setdiff(set_new,set_old)
Set{String} with 21 elements:
  "Cl3Ru"
  "O3Yb2"
  "H2O" #water is in new the inorganics database
  "AlLaO"
  "I2Sm"
  "B2Zr"
  "H3NaO4P"
  "HLi"
  "Al6O13Si2"
  "Cl2S2"
  "As2H12O3"
  "CW2"
  "C32H16CuN8"
  "OPr"
  "ClH2Tl"
  "H5NO5S"
  "C10O10Re2"
  "BLiO"
  "H2NaO4S"
  "BrH2Tl"
  "CdF2"
julia> setdiff(set_old,set_new)
Set{String} with 11 elements:
  "HNa2O4P"
  "ClTl"
  "H4Si"
  "H4Na2O12S3"
  "As2O3"
  "BrCsO3"
  "BrTl"
  "H2NaO4P"
  "F6H8N2Si"
  "F6Na2Si"
  "D2Se"