dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
840 stars 270 forks source link

corrupted SMILES #408

Open StefanSenger opened 8 years ago

StefanSenger commented 8 years ago

When comparing SMILES in dbpedia with the SMILES in Wikipedia I noticed that all of the ones I looked at seem to be truncated. It looks as if this happens at the first appearance of a ( character. For example, the SMILES string for Sildenafil on https://en.wikipedia.org/wiki/Sildenafil is "CN1CCN(S(=O)(C2=CC=C(OCC)C(C3=NC4=C(N(C)N=C4CCC)C(N3)=O)=C2)=O)CC1" whereas it is "CN1CCNCC1" in dbpedia (http://dbpedia.org/page/Sildenafil). To see this is a 'systematic error', I run a SPARQL query to retrieve pages with SMILES and searched for the character ( in the SMILES. From the 4774 SMILES I retrieved no SMILES string contained a ( character, which supports my assumption that when SMILES are extracted from Wikipedia they are truncated at the first appearance of this character. To gather further 'evidence', I tried to calculate the molecular weight based on the SMILES from dbpedia and compare this with the molecular weight from dbpedia. This worked for 1871 molecules. For 1834 of them the calculated molecular weight was smaller than the molecular weight from dbpedia, which again strongly indicates that the SMILES have been truncated. Would it be possible to look at the process used to derive the SMILES from Wikipedia to see if there is a step that might cause this truncation and to see if this can be fixed? If this causes difficulties we might be able to help since we have developed a workflow that extracts the SMILES directly from Wikipedia.

VladimirAlexiev commented 8 years ago

The literal extractor has a bunch of heuristics that try to find value(s) amongst a bunch of free text. The heuristics are imperfect.

But I think it's best to implement a specific SMILES extractor. Could you contribute:

Cheers!

StefanSenger commented 8 years ago

Hi Vladimir, thanks for your response. I think it would be really good if a specific SMILES extractor could be implemented. Sadly, I can't contribute the source of the workflow that I have mentioned since I am using a proprietary tool (Pipeline Pilot from BIOVIA), so I assume this wouldn't be any good to you, right? However, if you think this might be useful I can summarise the steps that I have used to extract the SMILES. I have just re-run the workflow for all pages that contain either a Drugbox or a Chembox and extracted the SMILES, which worked ok. As you mentioned in your response, the wikipedia property (across different languages) should be either SMILES or smiles. Having said that, a Chembox can contain more than one SMILES string (up to six) and they will then be called SMILES, SMILES1, SMILES2 etc. It's debatable how much sense it makes to have more than one SMILES string for a given compound, but here you go. There shouldn't be any extra junk present apart from the SMILES string. However, if you like me to check, I can have a look at my results. Also, if that's of any use to you I can share the output of my workflow with the SMILES that I have extracted. If it is, please just let me know the best way to share this.