thank you very much for your work. We are using HeidelTime in a dynamic setting and have several problems. We will list them here in issues together with design changes suggestions that should address them. Most will be straight-forward to implement for someone familiar with the project.
Is this software still under active development? If not, would you mind translating those high-level propositions to a lower level and point out, which parts of the implementation would need to change for that?
Standalone's dependencies
Speaking about the standalone version, as far as I understand, heideltime needs tokenized text to work, but it doesn't accept pretokenized text as input. Instead it contains hard-coded dependencies on external taggers (for tokenization as well as for POS-tagging), which need to be installed separately.
This has several disadvantages:
Out of sync Tokenization if you don't use the exact same Tokenizer (even then you have to run the Tokenizer twice)
The internally used Tokens are forgotten, as the TimeML-version in use does not support explicit Token-tags.
hard-coded dependencies (use those specific Tokenizers/Taggers or use none at all)
it's not standalone
currently generating the TimeML for a single textfile involves loading a big language model for Tokenization/POS-Tagging. tagging another file repeats the whole procedure.
Especially in dynamic contexts this introduces a huge cost that could be easily avoided.
It's quite simple to parse Tokenized texts, for example they could be given in a "one token per line" format, or similarly something like CoNLL. Not much harder should it be, to implement something similar allowing for already POS-tagged text, completely getting rid of hard-coded external dependencies without reducing performance, necessarily.
Solution:
[ ] Provide a way to parse pretokenized texts instead of invoking an external Tokenizer on your own.
[ ] Add CLI-Option to define data format (raw / pretokenized / POS-tagged (CoNLL)
Hello,
thank you very much for your work. We are using HeidelTime in a dynamic setting and have several problems. We will list them here in issues together with design changes suggestions that should address them. Most will be straight-forward to implement for someone familiar with the project.
Is this software still under active development? If not, would you mind translating those high-level propositions to a lower level and point out, which parts of the implementation would need to change for that?
Standalone's dependencies
Speaking about the standalone version, as far as I understand, heideltime needs tokenized text to work, but it doesn't accept pretokenized text as input. Instead it contains hard-coded dependencies on external taggers (for tokenization as well as for POS-tagging), which need to be installed separately.
This has several disadvantages:
Especially in dynamic contexts this introduces a huge cost that could be easily avoided.
It's quite simple to parse Tokenized texts, for example they could be given in a "one token per line" format, or similarly something like CoNLL. Not much harder should it be, to implement something similar allowing for already POS-tagged text, completely getting rid of hard-coded external dependencies without reducing performance, necessarily.
Solution: