HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
343 stars 67 forks source link

Singular noun vs. cardinal number > 1 (Arabic) #30

Closed bisoldi closed 9 years ago

bisoldi commented 9 years ago

I ran into a situation where a BBC article in Arabic had the phrase:

قبل 6 ساعة

Which literally means "before 6 hour".

Stanford POS tags it as NN, CD, NN. This is incorrect, the first word (on the right) is a preposition.

HeidelTime interprets this as PT1H, presumably because it read the word "hour" and because of the singularity of it ignored any cardinal numbers and defaulted to 1.

With the phrase:

قبل 6 ساعات

HeidelTime correctly picks up the cardinal number and the plural "hours" and interprets it as PT6H.

There is precedent for not having to specify the singular / plural form in Arabic, though I'm told only if it comes before the cardinal number. It's possible BBC is taking some liberty with that rule or my understanding is incorrect.

Either way, do you have any comment on this? Is this something that can be taken into account?

An example of the article is: http://www.bbc.com/arabic/middleeast/2015/07/150730_yemen_fighting (see at the top, underneath the headline).

I posted the issue with Stanford mentioned above on SO: http://stackoverflow.com/questions/31731575/corenlp-arabic-time-duration-misses-alpha-numeric-numbers

Thanks.

JannikStroetgen commented 9 years ago

Hi,

Thanks for opening this issue. I think it should be quite simple to modify HeidelTime's resources to catch such expressions. At the moment, I cannot test whether or not modifying the respective rule has negative influence in other cases, but for now you can do the following changes:

Please open the file: resources/arabic/rules/resources_rules_durationrules.txt and modify the EXTRACTION part of rule "duration_r2b" from EXTRACTION="(%reApproximate )?([\d]+) %reUnitT" to EXTRACTION="(%reApproximate )?([\d]+) (%reUnitT|%reUnitTwiceT|%reUnitT)"

This should work in your example, but as mentioned above, I cannot test this right now. Before releasing a new HeidelTime version, we will test whether this change has negative influence on other corpora and will include the changes if that's not the case.

Thanks! Jannik