tokenization of list item markers

UniversalDependencies / UD_English-EWT

English data

Creative Commons Attribution Share Alike 4.0 International

199 stars 42 forks source link

The tokenization of markers like "3." and "(a)" is not consistent across English treebanks.

I think we've agreed to leave it alone (previous discussion), but for posterity, here is some info that I collected back when we were discussing the new policy for list item markers:

OntoNotes Tokenization: In newer OntoNotes documents, LS generally includes periods and hyphens after the letter/number (“1.”, “1-”). But in OntoNotes-WSJ (and PTB Revised), the LS is strictly the letter or number—no associated punctuation characters. Parentheses are always tokenized separately.
In UD-EWT, LS tokens are treated as NUM (even if they are letters), with all punctuation characters tokenized separately. This seems to follow the original Penn EWT & WSJ tokenization. I am reluctant to mess with UD-EWT tokenization because it produces misalignments with the Penn trees. (Also, “(“ as the superficial head of a goeswith would be awkward.)
In UD-GUM and GENTLE, LS tokens are always tokenized in a single token, including parentheses and any other characters.

Thanks for the synopsis, that's helpful!

I'm not very attached to the X upos, though I do find NUM for things like "A" a bit strange. Regarding tokenization, it seems odd to split up the existing tokens only to connect them again with a trivial relation, and I don't think it makes sense to say that "1." contains a separate period token - for me "1" and "1." are basically interchangeable. I will add that the "1." cases are much more numerous in ON than the two token cases, so presumably tokenizers trained on ON favor not splitting. By contrast, splitting brackets is indeed the preference in ON, which I don't love because you can get unmatched brackets for things like "1)"... It also doesn't gel well with the strong ON preference not to split periods IMO.

In any case, even if we don't change any of the English datasets, it might be nice to get a statement of principle from UD about what is the core group's recommendation for new languages.

UniversalDependencies / UD_English-EWT

tokenization of list item markers #543