Open longemen3000 opened 3 years ago
Hi Andrés, Like all software not maintained, bits and pieces of the chemicals-metadata repository have rotted away. I cannot get the inchi module in rdkit to work for me, and I am having issues building rdkit. Thanks for letting me know about the issue. I'm afraid we may have to manually patch the file for now. Sincerely, Caleb
Hi Andrés, I found a version of rdkit which works on linux - and it's on pypi! One step closer to being able to update the database again. I think I actually need to port chemical-metadata to Python 3 as well.
Sincerely, Caleb
what do you think of adding ;
as an aditional separator? the main problem would checking if other names actually have ;
as part of their name.
maybe adding:
line = line.replace(';','\t')
before this line https://github.com/CalebBell/chemicals/blob/c5b1014d42216eaa93bf2fd46aec2d35beb82b8e/chemicals/identifiers.py#L370 could solve the problem temporally?
Also, i noticed (by a quick view, nothing exhaustive) that those synonyms separated by ';' are always at the end of the list.
Edit: the split ;
must always be done after parsing the InChI
Hi Andrés, I have fixed the chemical-metadata repository a lot, and generated a new inorganic file without this particular issue. I attached it.
What is hard to do is that the online data has changed so much, I can't even use a diff program to see what changed. Because of that, it's hard to replace the current file with the new one. Do you want to look at it?
Sincerely, Caleb
Hi Caleb,
Given the old and new versions, i could program a manual diff to see what's changed, I'm gonna start with this and let you know what I found.
for a preliminar parsing: there are more synonyms, compared to the old database:
julia> CC.load_db!(:inorganic_old2)
[ Info: :inorganic_old2 arrow file not generated, processing...
syms_i = 6326 #amount of synonyms
syms_unique = 6325 # unique elements (there is one element repeated that i have yet find)
(Arrow.Table with 153 rows, 9 columns, and schema:
.....
julia> CC.load_db!(:inorganic_new)
[ Info: :inorganic_new database file not found, downloading from https://github.com/CalebBell/chemicals/files/6912649/Inorganic.db.csv
[ Info: :inorganic_new database file downloaded.
[ Info: :inorganic_new arrow file not generated, processing...
syms_i = 9461
syms_unique = 9438
(Arrow.Table with 164 rows, 9 columns, and schema:
comparing the differences, by InChI:
"InChI=1S/CH2.Co/h1H2;/q-1;+1"
"InChI=1S/Cr.2H2Si/h;2*1H2"
"InChI=1S/H4Si/h1H4"
"InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4… "InChI=1S/F6Si.2H3N/c1-7(2,3,4,5)6;;/h;2*1H3… "InChI=1S/Bi.2ClH.2H/h;2*1H;;/q+2;;;;/p-2"
"InChI=1S/Al.Na.2O.2H/q-1;+1;;;;"
"InChI=1S/BrHO3.Cs/c2-1(3)4;/h(H,2,3,4);/q;+… "InChI=1S/2Na.H3O4P/c;;1-5(2,3)4/h;;(H3,1,2,… "InChI=1S/2BH2.Ti/h2*1H2;"
"InChI=1S/F6Si.2Na/c1-7(2,3,4,5)6;;/q-2;2*+1" ""
"InChI=1S/2Na.3H2O4S/c;;3*1-5(2,3)4/h;;3*(H2…
"InChI=1S/Cl2S2/c1-3-4-2"
"InChI=1S/O.Pr"
"InChI=1S/Bi.2ClH/h;2*1H/q+2;;/p-2"
"InChI=1S/Cr.2Si"
"InChI=1S/C32H16N8.Cu/c1-2-10-18-17(9-1)25-33-26(18)38-28-21-13-5-6-14-22(21)30(35-28)40-32-24-… "InChI=1S/Al.Na.2O/q-1;+1;;"
"InChI=1S/Al.La.O"
"InChI=1S/2B.Ti"
"InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4)"
"InChI=1S/C.Co/q-1;+1"
"InChI=1S/3O.2Yb/q3*-2;2*+3"
"InChI=1S/2HI.Sm/h2*1H;/q;;+2/p-2"
"InChI=1S/3ClH.Ru/h3*1H;/q;;;+3/p-3"
"InChI=1S/H2O/h1H2"
"InChI=1S/2B.Zr"
"InChI=1S/10CO.2Re/c10*1-2;;"
"InChI=1S/H3NO.H2O4S/c1-2;1-5(2,3)4/h2H,1H2;(H2,1,2,3,4)"
"InChI=1S/Li.H"
"InChI=1S/Na.H2O4S/c;1-5(2,3)4/h;(H2,1,2,3,4)"
"InChI=1S/C.2W/q+1;;-1"
"InChI=1S/6Al.2O2Si.9O/c;;;;;;2*1-3-2;;;;;;;;;"
"InChI=1S/B.Li.O"
"InChI=1S/Cd.2FH/h;2*1H/q+2;;/p-2"
doing the same thing with the formulas:
julia> setdiff(set_new,set_old)
Set{String} with 21 elements:
"Cl3Ru"
"O3Yb2"
"H2O" #water is in new the inorganics database
"AlLaO"
"I2Sm"
"B2Zr"
"H3NaO4P"
"HLi"
"Al6O13Si2"
"Cl2S2"
"As2H12O3"
"CW2"
"C32H16CuN8"
"OPr"
"ClH2Tl"
"H5NO5S"
"C10O10Re2"
"BLiO"
"H2NaO4S"
"BrH2Tl"
"CdF2"
julia> setdiff(set_old,set_new)
Set{String} with 11 elements:
"HNa2O4P"
"ClTl"
"H4Si"
"H4Na2O12S3"
"As2O3"
"BrCsO3"
"BrTl"
"H2NaO4P"
"F6H8N2Si"
"F6Na2Si"
"D2Se"
What is the search string
caustic soda liquid;aquafina;distilled water;hydrogen oxide (h2o);ultrexii ultrapure;
Which chemical in the database do you believe should be found? its water,but the separators here are wrong