cltl / NAF-4-Development

Apache License 2.0
2 stars 1 forks source link

Character offsets for representing discourse units (paragraphs, headers, etc.) #1

Closed sarnoult closed 3 years ago

sarnoult commented 4 years ago

We are also interested in representing paragraphs and headers for Clariah+, but I see that the current idea is to reference discourse units to spanned tokens. I would argue for using character offsets instead, as these discourse units can be present before a tokenizer comes into play.

Our input documents are in TEI format, where this kind of discourse units are already annotated, and we want to preserve their identifiers. Conversion from TEI would then generate a NAF file with a raw-text layer and a discourse-units layer. In our current pipeline, the tokenizer is only called afterwards, for each paragraph independently.

sarnoult commented 3 years ago

tunits have character offsets in v3.2