Closed jan-niestadt closed 1 year ago
For those wishing to contribute: DocIndexerTabular
is the class that handles tabular formats like TSV and CSV. The two options could be added to the fileTypeOptions that can be specified in a .blf.yaml
format definition file (see here).
Perhaps good to know we developed https://bitbucket.org/fryske-akademy/taaldatabanken/src/master/udpipe-tdb/teitagger/. It converts conllu to tei (with our namespace for linguistic attributes, which will be released soon).
I may contribute to this issue because of this idea (though going via tei/xpath3 is I think more powerful):
Sounds interesting! @JessedeDoes (who originally asked about this format), did you see this?
(Requested by @JessedeDoes) Expand TSV input type to be able to deal with the CoNLL-U format.
The format is basically a TSV with some special features (point 2 and 3):
So we should probably add two options, e.g. blankLinesMarkSentenceBoundaries (default false) and commentLineCharacter (if this is the first character on the line, skip that line; default: none)