dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
853 stars 269 forks source link

ChemBox extractor does not copy the identifiers? #54

Open egonw opened 11 years ago

egonw commented 11 years ago

If I look at http://live.dbpedia.org/page/Azulene I do not see the extracted InChI, CAS registry number, nor PubChem compound ID. What is the reason here?

(Is it possible to run the extractor on a single wikipedia page, removing the need for a full fledged MW/MySQL installation?)

jcsahnwaldt commented 11 years ago

(Is it possible to run the extractor on a single wikipedia page, removing the need for a full fledged MW/MySQL installation?)

http://mappings.dbpedia.org/server/extraction/en/

Example:

http://mappings.dbpedia.org/server/extraction/en/extract?title=Azulene

It's only the mapping based extractor and the label extractor though.

egonw commented 11 years ago

OK, cool. When you say that "only the mapping based extractor..." is supported, that means all in this file:

https://raw.github.com/dbpedia/extraction-framework/live/mappings/Mapping_en.xml

Right?

And can I also run that "extract" script locally easily, with a patched version of this Mapping_en.xml ?

jcsahnwaldt commented 11 years ago

Looks like the identifiers were moved to a sub-templte: http://en.wikipedia.org/wiki/Template:Chembox_Identifiers The mapping-based extractor only extracts stuff from the main template.

egonw commented 11 years ago

OK, got it. That's not an easy fix, I assume...

jcsahnwaldt commented 11 years ago

It would be an easy fix: add mappings for Chembox_Identifiers and a few others to the mappings wiki, change a few lines in MappingExtractor.scala: Currently, we only extract data from the top-level template and ignore nested templates. It wouldn't be hard to change that. I just don't know what such a change might break.

jcsahnwaldt commented 11 years ago

As for running the extraction locally: Mapping_en.xml is a copy of the mappings on the wiki and not up to date, but depending on the configuration, the extractor downloads the current mappings anyway. It's not hard to run the extraction locally. The main problem may be that it's currently not easy to configure the extraction to extract just a few pages from Wikipedia. We usually run the extraction on the whole dump (millions of pages). The configuration is rather unflexible, so it's hard to change the desired page source.

egonw commented 11 years ago

"It would be an easy fix:"... oh, then please enable processing the Chembox_Identifiers... we'll see from the existing mappings if that works, and then I could simply focus on the additional mappings.

What I understand is that I should edit the wiki page rather than the Mapping_en.xml? But at the moment I cannot edit the wiki. I just created an account http://mappings.dbpedia.org/index.php/User:Egonw

Additional identifiers I am interested in include the InChIKey, Standard InChI, Standard InChIKey, ChemSpider ID, but I guess I would just add all defined in the Chembox_Identifier template.

My interest comes from my involvement in BridgeDB and cheminformatics in general.

ninniuz commented 10 years ago

This would be an interesting enhancement.

Approaches I see are: 1) Recursively analyse TemplateNode's children and add quads from sub-templates (they would be assigned to the same root resource URI in case the sub-templates are mapped to the same class of the root template node) 2) Create a new PropertyMapping type (e.g PropertyTemplateMapping) which defines how to map template properties which value is a template itself 3) Extend the current PropertyMapping with a recurse property which tells the MappingExtractor to look for the specified templateProperty in any of the template's children

1) is very easy to implement but could potentially break something (not 100% sure) 2-3) should take same effort to develop but 2) would require changes on the mappings server as well (define a new Template)

@jcsahnwaldt what do you think?

egonw commented 10 years ago

On Mon, Nov 11, 2013 at 5:34 PM, Andrea Di Menna notifications@github.com wrote:

1) is very easy to implement but could potentially break something (not 100% sure) 2-3) should take same effort to develop but 2) would require changes on the mappings server as well (define a new Template)

I'd be more than happy to provide test data for ChemBoxes and child templates.

I do not know the code base, but given the right template, could possibly contribute to extracting identifiers from these ChemBoxes. The learning curve into the DBPedia extraction system has stopped me from making patches so far... that is, I have no clue on to run the code locally, to test any patch I'd write...

Egon

Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286

ninniuz commented 10 years ago

@jimkont what do you think about 1)?

jimkont commented 10 years ago

We already implemented something similar to (1) for separate templates maybe it makes sense to do it for nested too. I don't think it will break something but not 100% sure

On Tue, Nov 19, 2013 at 7:25 PM, Andrea Di Menna notifications@github.comwrote:

@jimkont https://github.com/jimkont what do you think about 1)?

— Reply to this email directly or view it on GitHubhttps://github.com/dbpedia/extraction-framework/issues/54#issuecomment-28811806 .

Kontokostas Dimitris

ninniuz commented 10 years ago

Should we use the main resource URI in case a child template is mapped to the same class of the root template or should we apply the subclass/superclass approach used in #4 ?

egonw commented 10 years ago

In case of chemical, templates are used to annotate properties of the central class...

E.g. for Azulene for CAS/KEGG/ChemSpider/PubChem identifiers do not define subclasses, but are "properties" of azulene itself.

ninniuz commented 10 years ago

Thanks @egonw :-) I was wondering how to process such templates when they are included in a main template (the Chembox example fits very well).

@jimkont I think simply replacing

if(graph.isEmpty)
{
  node.children.flatMap(child => extractNode(child, subjectUri, pageContext))
}
else
{
  graph
}

with

graph ++ node.children.flatMap(child => extractNode(child, subjectUri, pageContext))

in org.dbpedia.extraction.mappings.MappingExtractor#extractNode is enough.

We are simply going to collect potentially mappable subtemplates/subtables and demand the quad creation to the Mapping instance (i.e. TemplateMapping - which handles subclass/superclass or new URI creation already - or TableMapping)

jimkont commented 10 years ago

I think we should try #4

ninniuz commented 10 years ago

My previous comment uses #4 (as a side effect of demanding quad building to TemplateMapping). Can you be more specific? :D