dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
841 stars 271 forks source link

not all resources of an ObjectProperty map are extracted #462

Open VladimirAlexiev opened 8 years ago

VladimirAlexiev commented 8 years ago

I made a mapping for ConcentrationCamp, see testing here: http://mappings.dbpedia.org/server/mappings/en/extractionSamples/Mapping_en:Infobox_concentration_camp

Consider https://en.wikipedia.org/w/index.php?title=Auschwitz_concentration_camp&action=edit and this field:

| operated by = The Nazi ''[[Schutzstaffel]]'' (SS), the [[Soviet Union|Soviet]] [[NKVD]] (after [[World War II]])

The extraction result is:

dbo:operator dbr:Schutzstaffel, dbr:Soviet_Union

Why NKVD (and WW2) are missing from the list??? http://dbpedia.org/resource/NKVD does exist. dbo:operator is defined without domain/range restrictions

VladimirAlexiev commented 8 years ago

Same for https://en.wikipedia.org/w/index.php?title=Bełżec_extermination_camp&action=edit:

| known for    = [[Genocide]] during [[The Holocaust]]

results in

http://en.dbpedia.org/resource/Bełżec_extermination_camp  dbo:knownFor    
  http://en.dbpedia.org/resource/Genocide

but where is dbr:The_Holocaust?

jimkont commented 8 years ago

the property values are split according to some regexes defined here https://github.com/dbpedia/extraction-framework/blob/5699d6eb2e111c89268d0f4526c15b1130f33aa2/core/src/main/scala/org/dbpedia/extraction/config/dataparser/DataParserConfig.scala#L25-L25

VladimirAlexiev commented 8 years ago

You mean some part of the value is thrown away?

jimkont commented 8 years ago

I see that we miss good triples but this also ensures that some not so good are skipped e.g. [[World War II]] in the first example.

ideas on how to overcome this?

VladimirAlexiev commented 8 years ago

Can you explain the logic of this regex?

jimkont commented 8 years ago

so, after we split the property value, we take the first available link from each piece. e.g. The Nazi ''[[Schutzstaffel]]'' (SS), the [[Soviet Union|Soviet]] [[NKVD]] (after [[World War II]]) is broken in :

jimkont commented 8 years ago

if there was an and in the [[Soviet Union|Soviet]] [[NKVD]] (after [[World War II]]) e.g. the [[Soviet Union|Soviet]] and [[NKVD]] (after [[World War II]])

then we would also take [[NKVD]]

VladimirAlexiev commented 8 years ago

But what is the logic that a space in X Y should throw away Y? What if someone forgot a comma in a list?

VladimirAlexiev commented 8 years ago

It's very hard to suggest how to improve this regexp without documentation of the original rationale for having it. I would remove it altogether, then analyze what mis-hits this produces, then add it again with specific justification for every throw-away pattern.

jimkont commented 8 years ago

it certainly removes a lot of non-fitting values e.g. WW2 here see here for details ;) and sometimes the range is not enough for post-processing esp when the target has no type or the type is similar e.g. expected Doctor and found Athelete

VladimirAlexiev commented 8 years ago

The problem is that there's no justification or explanation of the exclusion regex. Eg I think that most often X Y corresponds to adjective noun, and we want to remove adjecive but keepnoun (eg in the above example, we want NKVD, but not Soviet).

Until some experiments are done to prove the value of this exclusion, and properly documented, I vote to remove the regex. (I personally doubt it's possible to exclude undesirable objects using purely regex methods.)

VladimirAlexiev commented 8 years ago

I found relevant stuff at http://www.mail-archive.com/dbpedia-discussion%40lists.sourceforge.net/msg03470.html But it argues why one should remove X out of X Y. Is that all the justification we got for the regex?