HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
342 stars 67 forks source link

Allow to pass pre-tokenized Texts to HeidelTime #89

Open narnold-cl opened 2 years ago

narnold-cl commented 2 years ago

Hello,

thank you very much for your work. We are using HeidelTime in a dynamic setting and have several problems. We will list them here in issues together with design changes suggestions that should address them. Most will be straight-forward to implement for someone familiar with the project.

Is this software still under active development? If not, would you mind translating those high-level propositions to a lower level and point out, which parts of the implementation would need to change for that?

Standalone's dependencies

Speaking about the standalone version, as far as I understand, heideltime needs tokenized text to work, but it doesn't accept pretokenized text as input. Instead it contains hard-coded dependencies on external taggers (for tokenization as well as for POS-tagging), which need to be installed separately.

This has several disadvantages:

Especially in dynamic contexts this introduces a huge cost that could be easily avoided.

It's quite simple to parse Tokenized texts, for example they could be given in a "one token per line" format, or similarly something like CoNLL. Not much harder should it be, to implement something similar allowing for already POS-tagged text, completely getting rid of hard-coded external dependencies without reducing performance, necessarily.

Solution: