Open VladimirAlexiev opened 8 years ago
Same for https://en.wikipedia.org/w/index.php?title=Bełżec_extermination_camp&action=edit:
| known for = [[Genocide]] during [[The Holocaust]]
results in
http://en.dbpedia.org/resource/Bełżec_extermination_camp dbo:knownFor
http://en.dbpedia.org/resource/Genocide
but where is dbr:The_Holocaust?
the property values are split according to some regexes defined here https://github.com/dbpedia/extraction-framework/blob/5699d6eb2e111c89268d0f4526c15b1130f33aa2/core/src/main/scala/org/dbpedia/extraction/config/dataparser/DataParserConfig.scala#L25-L25
You mean some part of the value is thrown away?
I see that we miss good triples but this also ensures that some not so good are skipped e.g. [[World War II]]
in the first example.
ideas on how to overcome this?
Can you explain the logic of this regex?
X (Y)
which holds some logic (a clarification in parentheses perhaps doesn't contribute a useful resource)X Y
but I don't see the logic of this decisionso, after we split the property value, we take the first available link from each piece. e.g.
The Nazi ''[[Schutzstaffel]]'' (SS), the [[Soviet Union|Soviet]] [[NKVD]] (after [[World War II]])
is broken in :
The Nazi ''[[Schutzstaffel]]'' (SS)
-> takes [[Schutzstaffel]]
the [[Soviet Union|Soviet]] [[NKVD]] (after [[World War II]])
-> takes [[Soviet Union|Soviet]]
if there was an and
in the [[Soviet Union|Soviet]] [[NKVD]] (after [[World War II]])
e.g. the [[Soviet Union|Soviet]] and [[NKVD]] (after [[World War II]])
then we would also take [[NKVD]]
But what is the logic that a space in X Y
should throw away Y? What if someone forgot a comma in a list?
It's very hard to suggest how to improve this regexp without documentation of the original rationale for having it. I would remove it altogether, then analyze what mis-hits this produces, then add it again with specific justification for every throw-away pattern.
it certainly removes a lot of non-fitting values e.g. WW2 here see here for details ;) and sometimes the range is not enough for post-processing esp when the target has no type or the type is similar e.g. expected Doctor and found Athelete
The problem is that there's no justification or explanation of the exclusion regex.
Eg I think that most often X Y
corresponds to adjective noun
, and we want to remove adjecive
but keepnoun
(eg in the above example, we want NKVD, but not Soviet).
Until some experiments are done to prove the value of this exclusion, and properly documented, I vote to remove the regex. (I personally doubt it's possible to exclude undesirable objects using purely regex methods.)
I found relevant stuff at
http://www.mail-archive.com/dbpedia-discussion%40lists.sourceforge.net/msg03470.html
But it argues why one should remove X out of X Y
.
Is that all the justification we got for the regex?
I made a mapping for ConcentrationCamp, see testing here: http://mappings.dbpedia.org/server/mappings/en/extractionSamples/Mapping_en:Infobox_concentration_camp
Consider https://en.wikipedia.org/w/index.php?title=Auschwitz_concentration_camp&action=edit and this field:
The extraction result is:
Why NKVD (and WW2) are missing from the list??? http://dbpedia.org/resource/NKVD does exist. dbo:operator is defined without domain/range restrictions