UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

sent_id format and parallel treebanks #321

Closed martinpopel closed 7 years ago

martinpopel commented 8 years ago

In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g. # sent_id = 123. This issue is about the format of the ID itself (i.e. the 123 part) and also about a related question of storing parallel treebanks in CoNLL-U.

My motivation

in short: bundle_id/zone An example of a valid sent_id is f123-s9/en_udpipe.

I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that

dan-zeman commented 8 years ago

+1

From the UD perspective, this proposal just reserves certain characters ("/") for specialized usage, which goes beyond the current scope of the UD project, yet I find it useful, and it has actually been deployed already. We will have to modify sentence IDs in Arabic UD, where the slash is used, but that should not be a problem.

The specification should go into the version 2 of UD guidelines. (To keep the format.md page focused on UD, I would just say that slash has a special meaning in IDs, and put the details in a separate page linked from there. However, the validator would have to check the entire syntax.)

manning commented 8 years ago

Approve!

spyysalo commented 7 years ago

The "slash is special" constraint made it to http://universaldependencies.org/v2/conll-u.html but doesn't appear in the current v2 draft of the format page. Was this rejected or or just forgotten? ( @jnivre )

jnivre commented 7 years ago

Just forgotten. Can you add it?

spyysalo commented 7 years ago

Updated to add

In sentence ids, the slash character ("/") is reserved for specialized downstream use and should be avoided in UD treebanks.

which I hope is sufficient for the initial release. If anyone is interested in documenting and linking these use cases, please do!