Closed rhdunn closed 10 months ago
Note: I'm using the following Python code to perform this check that could be added to neaten.py:
def get_misc(token, attr, default):
if 'misc' not in token:
return default
misc = token['misc']
if misc is None or attr not in misc:
return default
return misc[attr]
def validate_sentence_text(sent, language):
sent_id = sent.metadata['sent_id']
token_text = ""
word_text = ""
need_space = False
last_mwt_id = 0
need_mwt_space = False
for token in sent:
if type(token['id']) is int:
if need_space:
token_text += " "
word_text += " "
if last_mwt_id >= token['id']:
if last_mwt_id == token['id']:
need_space = need_mwt_space
else:
need_space = False
else:
need_space = get_misc(token, 'SpaceAfter', 'Yes') == 'Yes'
if last_mwt_id < token['id']:
token_text += token['form']
word_text += token['form']
elif '-' in token['id']:
if need_space:
token_text += " "
word_text += " "
need_space = False
last_mwt_id = token['id'][2]
need_mwt_space = get_misc(token, 'SpaceAfter', 'Yes') == 'Yes'
token_text += token['form']
if 'text' in sent.metadata:
if token_text != sent.metadata['text']:
print(f"ERROR: Sentence {sent_id} text does not match the token sequence.")
print(f"... Expect: {sent.metadata['text']}")
print(f"... Actual: {token_text}")
if language == 'en' and word_text != sent.metadata['text']:
print(f"ERROR: Sentence {sent_id} text does not match the word sequence.")
print(f"... Expect: {sent.metadata['text']}")
print(f"... Actual: {word_text}")
else:
print(f"ERROR: Sentence {sent_id} is missing sentence text.")
There are some weird characters in EWT—this is intentional to preserve the exact string from the original text: UniversalDependencies/UD_English-EWT#83
I suppose this no-break space should ideally be recorded on the preceding word token with the SpacesAfter attribute. That would enable reconstruction of the original string.
That makes sense. It should be easy to modify the above script to check for and use a SpacesAfter
annotation.
TBC, this is the only sentence in EWT with the NBSP character. It is now reflected in a SpacesAfter
feature. Thanks
SpacesAfter
use backslash escapes, but I don't know of such an escape for NBSP. So I've put it in quotes.Oof, sorry to hear that. Looking at the misc link you posted, it looks like the specific escaping is custom to UD treebanks anyway considering the use of \p
, so what if we add a new one? \S
for example (although that makes the sequences case sensitive),
, \b
as the unused character from nbsp...
Probably allowing the validator to accept NBSP at the end of a token is the wrong answer, as it would cause problems with any tool which expects those fields to not have whitespace (strip()
being a common function call)
NBSP has the Unicode value of U+00A0, so would something like \x00A0
work?
It sounds like the UniversalDependencies/docs#982 discussion has converged on \u00A0
, which is Python-compatible. I'll switch it to that.
The sentence
newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011
contains an NBSP (non-breaking whitespace) character between "have" and "been" -- this means that thetext
metadata does not match the token/word stream text reconstruction.It may be helpful here for
neaten.py
to check that the sentence text only contains single U+0020 space characters. Better yet, it should check that the reconstructed word and token streams (as English MWT words are split tokens).