sent_id format and parallel treebanks

martinpopel commented 8 years ago

In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g. # sent_id = 123. This issue is about the format of the ID itself (i.e. the 123 part) and also about a related question of storing parallel treebanks in CoNLL-U.

My motivation

CoNLL-U format should be used not only for storing UD treebanks (frozen in v1.2, 1.3 etc.) but also as data interchange format and for various NLP tools, in all intermediate stages of the pipeline. See #242.
I would like to store parallel treebanks with word alignment in CoNLL-U format. For many reasons (e.g. efficient parallel processing, serialization, streaming, consistency and alignment) it is useful to have all the languages in one file (interleaved as: sent1-langA, sent1-langB, sent2-langA, sent2-langB etc). We plan to release Czech-English treebank CzEng 1.6 with 62M sentences in this format. See a sample. By parallel treebanks I mean not only different languages and paraphrases, but also alternative annotations of the same sentence, e.g. gold and automatic.
I would like to store word-alignment and coreference (and possibly other types of relations) links in CoNLL-U files. Coreference can go across sentences. This has some consequences for IDs. I plan to open a separate issue for this soon.
I would like to keep the CoNLL-U format simple (not bloated like CoNLL2009).
My proposal

in short: bundle_id/zone An example of a valid sent_id is f123-s9/en_udpipe.

The part (f123-s9) is called bundle_id and in parallel treebanks it is shared for all translations of the same sentence (which form a so-called bundle). The internal structure of bundle_id can reflect the original treebank numbering, e.g. here f123 is the filename and s9 is the 9th bundle in that file. I suggest bundleid format is restricted by a `[a-zA-Z0-9-]+` regex. We can make it less strict if needed for some legacy data, but it should not contain whitespace nor slash.
The second part (en_udpipe) is so-called zone and it can be omitted in treebanks where each bundle has just one zone (so the zone is an empty string). If present, it must be separated by a slash from the bundleid and it must match the regex `^[a-z-]+([a-zA-Z0-9-]+)?$`. The internal structure of zone is language_selector, where the _selector part is optional.
language is a ISO639 (or rather IETF) language code
selector is any string (^[a-zA-Z0-9-]+$), which allows to store parallel sentences in the same language. E.g. udpipe indicates that the tree was parsed using UDPipe. Another example: selectors ref and mt may distinguish reference translation and machine translation.
Notes

I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that

You can use simple IDs (e.g. integers) as sent_id and just one language (one zone) per file. It is still valid according to the proposal.
I think IDs should be optional in CoNLL-U (though I would like to see them in all UD v2 treebanks). All UD-compatible tools should handle files without IDs. This proposal is just for those who need IDs, so they use it in the same standardized way allowing interoperability.
We have a real need for such format (e.g. releasing the CzEng treebank in CoNLL-U, evaluation and visualization tools, an MT system).
We are working on a Python+Perl+Java API for UD called Udapi, which benefits from the proposal and also makes it easy to use (e.g. extract trees from one zone and store in a separate file). We want to invite the UD community to contribute to Udapi soon.

dan-zeman commented 8 years ago

+1

From the UD perspective, this proposal just reserves certain characters ("/") for specialized usage, which goes beyond the current scope of the UD project, yet I find it useful, and it has actually been deployed already. We will have to modify sentence IDs in Arabic UD, where the slash is used, but that should not be a problem.

The specification should go into the version 2 of UD guidelines. (To keep the format.md page focused on UD, I would just say that slash has a special meaning in IDs, and put the details in a separate page linked from there. However, the validator would have to check the entire syntax.)

manning commented 8 years ago

Approve!

spyysalo commented 7 years ago

The "slash is special" constraint made it to http://universaldependencies.org/v2/conll-u.html but doesn't appear in the current v2 draft of the format page. Was this rejected or or just forgotten? ( @jnivre )

jnivre commented 7 years ago

Just forgotten. Can you add it?

spyysalo commented 7 years ago

Updated to add

In sentence ids, the slash character ("/") is reserved for specialized downstream use and should be avoided in UD treebanks.

which I hope is sufficient for the initial release. If anyone is interested in documenting and linking these use cases, please do!

UniversalDependencies / docs

sent_id format and parallel treebanks #321

My motivation

My proposal

Notes