HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
343 stars 67 forks source link

"second half" matches too broad. #48

Open kno10 opened 7 years ago

kno10 commented 7 years ago

In particular, it occurs frequently in sports (and thus, in Wikipedia, news articles, books, ...).

E.g. https://en.wikipedia.org/wiki/1958_FIFA_World_Cup_Final

The 1958 FIFA World Cup Final took place in Råsunda Stadium, Solna (near Stockholm), Sweden on 29 June 1958. [...] Sweden took the lead after only 4 minutes after an excellent finish by captain Nils Liedholm. The lead didn't last long however as Vavá equalised just 5 minutes later. On 32 minutes, Vavá scored a similar goal to his first to give Brazil a lead 2–1 at the break. 10 minutes into the second half, Brazil went further in front thanks to a brilliant goal scored by Pelé.

Here, second half will be interpreted as 1958-H2. I suggest to disable rule date_r10b because of these false positives. Also, "after only 4 minutes", "on 32 minutes", and "10 minutes into" should probably be relative time references rather than durations.

Also, e.g. in Wikipedia "Karl May":

May's first translated work was the first half of the Orient Cycle into a French daily in 1881.

Suggest rule updates for the time references:

RULENAME="date_r20a",EXTRACTION="\b(?:%reApproximate )?(?:several|a couple of|some|a few|many) %reUnitFine (?:later|into)",NORM_VALUE="FUTURE_REF"
RULENAME="date_r20b",EXTRACTION="\b(?:%reApproximate )?%reNumWord12D %reUnitFine (?:later|into)",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-%normDurationNumber(group(2))",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20c",EXTRACTION="\b(?:%reApproximate )?(\d+) %reUnitFine (?:later|into)",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-group(2)",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20d",EXTRACTION="\b(?:%reApproximate )?an? %reUnitFine (?:later|into)",NORM_VALUE="UNDEF-REF-%normUnit(group(2))-PLUS-1",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20e",EXTRACTION="\brecent %reUnit",NORM_VALUE="PAST_REF"
RULENAME="date_r20f",EXTRACTION="\b[Oo]n (?:%reApproximate )?(\d+) %reUnitFine",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-group(2)",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20g",EXTRACTION="\b[Oo]n (?:%reApproximate )?%reNumWord12D %reUnitFine",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-%normDurationNumber(group(2))",NORM_MOD="%normApprox4Dates(group(1))"
JannikStroetgen commented 7 years ago

I agree with the "second half" issue (some holds for quarters etc.)

Regarding the minute expressions, we keep on following TimeML, these are durations, which can be anchored. The value remains a duration. Independent of whether you agree or not: it is very unlikely that these expressions would be normalized correctly. You would require the start time of the match and further very specific knowledge, e.g., that 46 minute is not 1 minute later than 45 minute but about 16 minutes later...

kno10 commented 7 years ago

But in the same text, 5 minutes later was mapped differently, into UNDEF-REF-minute-PLUS-5, because of the existing date_r20c rule.

I agree that we usually won't be able to translate them into absolute time points; in particular with game time.

JannikStroetgen commented 7 years ago

I never said that the annotation standard is perfect, just that we try to follow it. And that's how we interpreted it... But in general, it might be worth to think about "trying to anchor everything" independent of the TIMEX3 annotation standard. But this would probably have to result in a new standard with new kinds of problems.

kno10 commented 7 years ago

Should then UNDEF-REF-minute-PLUS-5 be translated into PT5M with modifier AFTER?

JannikStroetgen commented 7 years ago

Well, not following the annotation standard, things such as "later" (two days later) or "ago" (three weeks ago) are handled as part of the temporal expressions. using your examples:

4 minutes after an excellent finish by captain Nils Liedholm. --> type=duration; "after" is not part of the temporal expression, it belongs to "an excellent finish ..." The lead didn't last long however as Vavá equalised just 5 minutes later. --> type=time; On 32 minutes

--> type=duration; "on" is not part of the expression as it's a preposition preceding the temporal expression

It's due to different linguistic realizations, so I would not bet that everything is fully consistent - in particular not in the system's output, but probably not even in the annotation guidelines.