UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

tokenization of list item markers #543

Open nschneid opened 1 month ago

nschneid commented 1 month ago

The tokenization of markers like "3." and "(a)" is not consistent across English treebanks.

I think we've agreed to leave it alone (previous discussion), but for posterity, here is some info that I collected back when we were discussing the new policy for list item markers:

amir-zeldes commented 1 month ago

Thanks for the synopsis, that's helpful!

I'm not very attached to the X upos, though I do find NUM for things like "A" a bit strange. Regarding tokenization, it seems odd to split up the existing tokens only to connect them again with a trivial relation, and I don't think it makes sense to say that "1." contains a separate period token - for me "1" and "1." are basically interchangeable. I will add that the "1." cases are much more numerous in ON than the two token cases, so presumably tokenizers trained on ON favor not splitting. By contrast, splitting brackets is indeed the preference in ON, which I don't love because you can get unmatched brackets for things like "1)"... It also doesn't gel well with the strong ON preference not to split periods IMO.

In any case, even if we don't change any of the English datasets, it might be nice to get a statement of principle from UD about what is the core group's recommendation for new languages.