cheminfo / wikipedia

Wikipedia chemical structure explorer
https://wikipedia.cheminfo.org
Other
55 stars 15 forks source link

Why are particular SMILES not downloaded from Wikipedia? #29

Closed baoilleach closed 9 years ago

baoilleach commented 9 years ago

I've been cross-checking some data I've extracted from Wikipedia versus the dump file in your github repo and noticed some discrepancies.

For example, SMILES data from the drugbox on https://en.wikipedia.org/wiki/Lanreotide or the chembox on https://en.wikipedia.org/wiki/Hemorphin-4 is not included in the dump file.

Is there some reason for this or is it a bug?

targos commented 9 years ago

It seems that the SMILES does not represent correctly the aromaticity and is not parsable. This is why the molecule does not appear. It should be in the list of errors that is available from http://www.cheminfo.org/wikipedia We don't know why many SMILES use the lowercase atom name to describe aromaticity which always give some troubles. In order to solve the problem you just need to put SMILES with localized double bonds. The update is done nightly.

baoilleach commented 9 years ago

Gotcha - the Trp is n instead of [nH] in each of these cases. I'll fix them as I find them.

lpatiny commented 9 years ago

Great ! Thanks !

peter-ertl commented 9 years ago

As mentioned in our article in J.Cheminformatics, the "pyrrole nitrogen" problem was clearly the most common error in Wikipedia SMILES, occuring more than 350 times. Many these errors have been fixed by the project team, but many still remains. Best peter

Great ! Thanks !

— Reply to this email directly or view it on GitHub https://github.com/cheminfo/wikipedia/issues/29#issuecomment-141091123.

baoilleach commented 9 years ago

Yes - I should have read the paper properly. I've since discovered the source of those SMILES - maybe we can discuss offline.