UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Recommended format for alignments? #846

Open nschneid opened 2 years ago

nschneid commented 2 years ago

Has anyone developed a standard for representing crosslinguistic token-level alignments in UD?

I am familiar with https://github.com/macleginn/exploring-clmd-divergences, but it seems that those alignments are stored in a separate file. It might be nice to have them in sentence-level metadata lines. A straw man example, using self:other notation for each pair:

# sent_id = mycorpus.10
# align = :en/othercorpus.21 1:15 2-3:16 :en/othercorpus.22 4:1 4.1:2,3 5,6,8:5 7:_ _:6

(keeping in mind that the current sentence may align to multiple parallel sentences, and tokens may not align 1-to-1, and it may make sense to align multiword tokens like 2-3 or empty nodes like 4.1).

Does this seem like an idea worth pursuing or is it best left to individual projects to define their own formats?

martinpopel commented 2 years ago

When releasing CzEng 1.6 parallel Czech-English treebank in 2016, we've decided to store the alignments in the DEPS column, which we wanted to rename to LINKS, so that word alignment could be stored there in addition to enhanced dependencies (which could by distinguished by the relation label, whereas all types of alignment would start with align). See a CoNLL-U sample and its visualization (generated using Udapi). However, there were several voices against generalizing the DEPS column this was, so it's surely not the recommended format. It would be acceptable in the MISC column (adding an attribute name, e.g. Align=). So the alignment would be stored in the source nodes.

In #321 (and elsewhere), I suggested a way how to store a parallel treebank in a single file and encode the structure in sent_id. It would be still possible to split the file and store each language independently, but there are many use cases when a single file/stream is beneficial (e.g. piping CoNLL-U to stdout/stdin).

I don't have any plans for using word alignment in CoNLL-U now. Originally, I planned to store coreference links similarly to alignment (just within the same "zone"/language), but meanwhile we adopted (and adapted) the GUM style of coreference annotation. So in the end I don't have strong preferences about the UD-recommended way of storing word alignment.

nschneid commented 2 years ago

@martinpopel thanks for the historical context. TBC the "bundle_id/zone" strategy assumes all languages follow the same sentence segmentation, right? This is not the case in some of the data I'm working with.

martinpopel commented 2 years ago

the "bundle_id/zone" strategy assumes all languages follow the same sentence segmentation, right?

The strategy gives an implicit sentence alignment, but it can be used even for non 1-1 alignments. E.g. if sentence A is aligned to two sentences B1 and B2, but A-B1 has more word alignments than A-B2, we can put A into the same bundle as B1 and the second bundle will contain only B2 (so the A zone will be empty in the second bundle). The word alignments can still go across sentence (bundle) boundaries, so no information is lost.

That said, users may prefer simplicity, i.e. 1-1 aligned sentences at the cost of adapting the segmentation guidelines in one of the languages. In our study most of the non 1-1 sentence alignments could be "solved" by segmenting at semicolons and some commas in addition to full stops. Cases of non-monotonic sentence alignment were extremely rare.

arademaker commented 2 years ago

I believe the general approach to store alignments and any other layer of information should be the use of conllup format, right?

dan-zeman commented 2 years ago

I believe the general approach to store alignments and any other layer of information should be the use of conllup format, right?

Yes. (Or you pack it in the MISC column.)

amir-zeldes commented 2 years ago

I would love to use conllup for this and more, but sadly the 'pack it into MISC' option is the only one you can use if you want to keep the data as part of the main UD repo of the corpus, since conllup is not accepted by the validator...