Open nschneid opened 1 month ago
Thanks for the synopsis, that's helpful!
I'm not very attached to the X
upos, though I do find NUM
for things like "A" a bit strange. Regarding tokenization, it seems odd to split up the existing tokens only to connect them again with a trivial relation, and I don't think it makes sense to say that "1." contains a separate period token - for me "1" and "1." are basically interchangeable. I will add that the "1." cases are much more numerous in ON than the two token cases, so presumably tokenizers trained on ON favor not splitting. By contrast, splitting brackets is indeed the preference in ON, which I don't love because you can get unmatched brackets for things like "1)"... It also doesn't gel well with the strong ON preference not to split periods IMO.
In any case, even if we don't change any of the English datasets, it might be nice to get a statement of principle from UD about what is the core group's recommendation for new languages.
The tokenization of markers like "3." and "(a)" is not consistent across English treebanks.
I think we've agreed to leave it alone (previous discussion), but for posterity, here is some info that I collected back when we were discussing the new policy for list item markers: