UniversalDependencies / UD_Arabic-NYUAD

Other
2 stars 2 forks source link

Untokenized text with original whitespace #3

Open nikitakit opened 3 years ago

nikitakit commented 3 years ago

I notice that this treebank does not have annotations for the original whitespace (i.e. SpaceAfter=No fields).

It looks like the LDC distributions of ATB contain the original text that the treebank is based on, and there are a few cases (mostly related to punctuation and numbers) where the text doesn't put any whitespace between treebank tokens.

In case anyone is interested, I wrote a script to add whitespace information to the CONLL-U files based on the original text as distributed by LDC.

AngledLuffa commented 6 months ago

@nikitakit thank you from across the bay!