UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

NBSP in sentence text. #418

Closed rhdunn closed 10 months ago

rhdunn commented 12 months ago

The sentence newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011 contains an NBSP (non-breaking whitespace) character between "have" and "been" -- this means that the text metadata does not match the token/word stream text reconstruction.

It may be helpful here for neaten.py to check that the sentence text only contains single U+0020 space characters. Better yet, it should check that the reconstructed word and token streams (as English MWT words are split tokens).

rhdunn commented 12 months ago

Note: I'm using the following Python code to perform this check that could be added to neaten.py:

def get_misc(token, attr, default):
    if 'misc' not in token:
        return default
    misc = token['misc']
    if misc is None or attr not in misc:
        return default
    return misc[attr]

def validate_sentence_text(sent, language):
    sent_id = sent.metadata['sent_id']
    token_text = ""
    word_text = ""
    need_space = False
    last_mwt_id = 0
    need_mwt_space = False
    for token in sent:
        if type(token['id']) is int:
            if need_space:
                token_text += " "
                word_text += " "
            if last_mwt_id >= token['id']:
                if last_mwt_id == token['id']:
                    need_space = need_mwt_space
                else:
                    need_space = False
            else:
                need_space = get_misc(token, 'SpaceAfter', 'Yes') == 'Yes'
                if last_mwt_id < token['id']:
                    token_text += token['form']
            word_text += token['form']
        elif '-' in token['id']:
            if need_space:
                token_text += " "
                word_text += " "
                need_space = False
            last_mwt_id = token['id'][2]
            need_mwt_space = get_misc(token, 'SpaceAfter', 'Yes') == 'Yes'
            token_text += token['form']

    if 'text' in sent.metadata:
        if token_text != sent.metadata['text']:
            print(f"ERROR: Sentence {sent_id} text does not match the token sequence.")
            print(f"... Expect: {sent.metadata['text']}")
            print(f"... Actual: {token_text}")
        if language == 'en' and word_text != sent.metadata['text']:
            print(f"ERROR: Sentence {sent_id} text does not match the word sequence.")
            print(f"... Expect: {sent.metadata['text']}")
            print(f"... Actual: {word_text}")
    else:
        print(f"ERROR: Sentence {sent_id} is missing sentence text.")
nschneid commented 12 months ago

There are some weird characters in EWT—this is intentional to preserve the exact string from the original text: UniversalDependencies/UD_English-EWT#83

I suppose this no-break space should ideally be recorded on the preceding word token with the SpacesAfter attribute. That would enable reconstruction of the original string.

rhdunn commented 12 months ago

That makes sense. It should be easy to modify the above script to check for and use a SpacesAfter annotation.

AngledLuffa commented 10 months ago

https://github.com/UniversalDependencies/UD_English-EWT/pull/463

nschneid commented 10 months ago

TBC, this is the only sentence in EWT with the NBSP character. It is now reflected in a SpacesAfter feature. Thanks

nschneid commented 10 months ago

463 didn't pass validation—a literal space at the end of a line is prohibited. Apparently the other treebanks that specify SpacesAfter use backslash escapes, but I don't know of such an escape for NBSP. So I've put it in quotes.

AngledLuffa commented 10 months ago

Oof, sorry to hear that. Looking at the misc link you posted, it looks like the specific escaping is custom to UD treebanks anyway considering the use of \p, so what if we add a new one? \S for example (although that makes the sequences case sensitive), &nbsp;, \b as the unused character from nbsp...

Probably allowing the validator to accept NBSP at the end of a token is the wrong answer, as it would cause problems with any tool which expects those fields to not have whitespace (strip() being a common function call)

rhdunn commented 10 months ago

NBSP has the Unicode value of U+00A0, so would something like \x00A0 work?

nschneid commented 10 months ago

It sounds like the UniversalDependencies/docs#982 discussion has converged on \u00A0, which is Python-compatible. I'll switch it to that.