INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
106 stars 52 forks source link

Support for CoNLL-U format #201

Closed jan-niestadt closed 1 year ago

jan-niestadt commented 6 years ago

(Requested by @JessedeDoes) Expand TSV input type to be able to deal with the CoNLL-U format.

The format is basically a TSV with some special features (point 2 and 3):

  1. Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
  2. Blank lines marking sentence boundaries.
  3. Comment lines starting with hash (#).

So we should probably add two options, e.g. blankLinesMarkSentenceBoundaries (default false) and commentLineCharacter (if this is the first character on the line, skip that line; default: none)

jan-niestadt commented 3 years ago

For those wishing to contribute: DocIndexerTabular is the class that handles tabular formats like TSV and CSV. The two options could be added to the fileTypeOptions that can be specified in a .blf.yaml format definition file (see here).

eduarddrenth commented 1 year ago

Perhaps good to know we developed https://bitbucket.org/fryske-akademy/taaldatabanken/src/master/udpipe-tdb/teitagger/. It converts conllu to tei (with our namespace for linguistic attributes, which will be released soon).

I may contribute to this issue because of this idea (though going via tei/xpath3 is I think more powerful): image

jan-niestadt commented 1 year ago

Sounds interesting! @JessedeDoes (who originally asked about this format), did you see this?