We are also interested in representing paragraphs and headers for Clariah+, but I see that the current idea is to reference discourse units to spanned tokens. I would argue for using character offsets instead, as these discourse units can be present before a tokenizer comes into play.
Our input documents are in TEI format, where this kind of discourse units are already annotated, and we want to preserve their identifiers. Conversion from TEI would then generate a NAF file with a raw-text layer and a discourse-units layer. In our current pipeline, the tokenizer is only called afterwards, for each paragraph independently.
We are also interested in representing paragraphs and headers for Clariah+, but I see that the current idea is to reference discourse units to spanned tokens. I would argue for using character offsets instead, as these discourse units can be present before a tokenizer comes into play.
Our input documents are in TEI format, where this kind of discourse units are already annotated, and we want to preserve their identifiers. Conversion from TEI would then generate a NAF file with a raw-text layer and a discourse-units layer. In our current pipeline, the tokenizer is only called afterwards, for each paragraph independently.