HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
343 stars 67 forks source link

Incorrect resolved "2nd or 3rd century BC" -> 01 (AD) #47

Open kno10 opened 7 years ago

kno10 commented 7 years ago

This is the example for date_historic_5c-BCADhint, but it is incorrectly resolved.

The problem is the overlap handling. We have four matches here:

the tail is correctly resolved as the BC match is longer. But in the top part, the matching range is set to the beginning only. By the current logic, this is an exact duplicate. I suggest to prefer the longer timex value (if different), i.e. BC01 over 01 assuming that it is a more complex match:

else if (t1.getTimexValue().length() > t2.getTimexValue().length()) {
  hsTimexesToRemove.add(t2);
}

(for the diff in my branch, see: b637df326de93fb21f1433b89a7ee8b9a008773b)

JannikStroetgen commented 7 years ago

I would not use the longer one in general here. But the correct should be selected if the rules are processed in the correct order. I'll check that...

kno10 commented 7 years ago

I may well have changed the order in my branch. I got rid of the hash maps; and store them in a list instead. There I order the rules alphabetically - are they supposed to be executed in the order they are in the input file?

My branch now is much faster. I cut the time from processing all of Wikipedia from 20 hours to less than 6 hours. One of the things I removed is the call Toolbox.sortByValue(hmPattern) for every sentence again. But for all I can tell, this sorts rules lexically (which would place rule_20 before rule_3); so both my branch and the original branch use the same order. Also, file order does not solve this correctly (the -BCAD rules are fairly early). I would need to move the historic rules after the other positive rules section.

JannikStroetgen commented 7 years ago

Sorry for the delay. Yes, the rules are indeed meant to be executed in a specific way. This becomes important if there are identical matches by different rules. I do not say that your reordering is the reason for the above mentioned mistake though.... What you could / should try after any modifications is to validate if you still get the identical results on all the evaluation corpora as listed on the Wiki page. Instructions how to reproduce the eval results can be found here