Closed martinpopel closed 7 years ago
+1
From the UD perspective, this proposal just reserves certain characters ("/") for specialized usage, which goes beyond the current scope of the UD project, yet I find it useful, and it has actually been deployed already. We will have to modify sentence IDs in Arabic UD, where the slash is used, but that should not be a problem.
The specification should go into the version 2 of UD guidelines. (To keep the format.md page focused on UD, I would just say that slash has a special meaning in IDs, and put the details in a separate page linked from there. However, the validator would have to check the entire syntax.)
Approve!
The "slash is special" constraint made it to http://universaldependencies.org/v2/conll-u.html but doesn't appear in the current v2 draft of the format page. Was this rejected or or just forgotten? ( @jnivre )
Just forgotten. Can you add it?
Updated to add
In sentence ids, the slash character ("/") is reserved for specialized downstream use and should be avoided in UD treebanks.
which I hope is sufficient for the initial release. If anyone is interested in documenting and linking these use cases, please do!
In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g.
# sent_id = 123
. This issue is about the format of the ID itself (i.e. the123
part) and also about a related question of storing parallel treebanks in CoNLL-U.My motivation
My proposal
in short: bundle_id/zone An example of a valid sent_id is
f123-s9/en_udpipe
.f123-s9
) is called bundle_id and in parallel treebanks it is shared for all translations of the same sentence (which form a so-called bundle). The internal structure of bundle_id can reflect the original treebank numbering, e.g. here f123 is the filename and s9 is the 9th bundle in that file. I suggest bundleid format is restricted by a `[a-zA-Z0-9-]+` regex. We can make it less strict if needed for some legacy data, but it should not contain whitespace nor slash.en_udpipe
) is so-called zone and it can be omitted in treebanks where each bundle has just one zone (so the zone is an empty string). If present, it must be separated by a slash from the bundleid and it must match the regex `^[a-z-]+([a-zA-Z0-9-]+)?$`. The internal structure of zone is language_selector, where the _selector part is optional.^[a-zA-Z0-9-]+$
), which allows to store parallel sentences in the same language. E.g.udpipe
indicates that the tree was parsed using UDPipe. Another example: selectorsref
andmt
may distinguish reference translation and machine translation.Notes
I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that