Orange-OpenSource / conllueditor

ConllEditor is a tool to edit dependency syntax trees in CoNLL-U format.
BSD 3-Clause "New" or "Revised" License
54 stars 17 forks source link

MTW => MWT #15

Closed martinpopel closed 2 years ago

martinpopel commented 2 years ago

I believe the documentation uses a wrong terminology: what is being referred to as MTW (multi-token word) is actually MWT (multi-word token). See the official UD documentation, which says:

We refer to such cases as multiword tokens because a single orthographic token corresponds to multiple (syntactic) words.

The MWT name is used in several other projects (e.g. Udapi).

It is very misleading for the users because both MWT and MTW are allowed in UD:

In exceptional cases, it may be necessary to go in the other direction, and combine several orthographic tokens into a single syntactic word. Starting from v2 of the UD guidelines, such multitoken words are allowed for a restricted class of phenomena, such as numerical expressions like 20 000 and abbreviations like e. g., as long as these phenomena are approved and clearly specified in the language-specific documentation.

So a MTW is simply a word whose form (and lemma) includes a space. There is no need for special support of MTW in Conllu editor, I think (except for running the validator and checking that MTWs are used only in the exceptional cases listed for a given language).

jheinecke commented 2 years ago

Hi! You're right. I had started to correct that a while ago but did not finish properly. I'm doing it now and commit as soon as it's done. Thanks for reminding :-)

arademaker commented 2 years ago

I thought MTW should be handled with the dependency relation goeswith.

martinpopel commented 2 years ago

goeswith is only for text that is not well edited, where the extra space is an error, e.g. "with out" or "never the less".

MTW is for expressions like "20 000" and abbreviations like "e. g.", where the space is not an error (although in these two examples some styles/languages allow writing it without the space as well).

This issue is not about real MTW nor goeswith. It is about MWT which were just by a mistake referred to as MTW.

jheinecke commented 2 years ago

I corrected the documentation and traces of mtw in the code, so it should be clearer now that we deal with MWT here.