HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
342 stars 67 forks source link

Consider supporting ISO-TimeML standard #92

Open narnold-cl opened 2 years ago

narnold-cl commented 2 years ago

The ISO-TimeML version of the TimeML Standard offers (at least) the following benefits:

Read about it here: https://lexitron.nectec.or.th/public/LREC-2010_Malta/pdf/55_Paper.pdf

If supporting the complete standard is too much work, it would still be nice, to have standoff annotations. We currently calculate those manually and fuzzy-match them to the Token- and Sentence-Boundaries detected by our own Preprocessing Pipeline.

Compromise to add standoff information to actual inline TimeML annotations

A simple fix to this specific problem would be (optionally) adding the CharacterPositions to the tagged Spans like so:

# input text:
"Today I feel great."

# currently generated TimeML output:
'<?xml version="1.0"?><!DOCTYPE TimeML SYSTEM "TimeML.dtd"><TimeML>
<TIMEX3 tid="t1" type="DATE" value="2021-11-16">Today</TIMEX3> nothing happened.
</TimeML>'

# Proposed additional tag-attributes (orig_start_char, orig_end_char):
<TIMEX3 tid="t1" type="DATE" value="2021-11-16" orig_start_char="0" orig_end_char="5">Today</TIMEX3>

So this would capture the information the Original-Span tagged by the TIMEX3 with tid t1, is referring to the Span from character 0 (inclusive) to character 5 (exclusive).

Again, this information is necessary to synchronize HeidelTimes internally used but then forgotten Tokenization with your own Tokenization.

The information for those additional attributes should be easily accessible at runtime.

We've already implemented a first draft of a parsing algorithm that incrementally generates those char-based Span indices afterwards, but it feels like it's a lot of duplicate work to reconstruct information that has already been there at HeidelTime's runtime.