Raldir / FEVEROUS

Repository for Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), accepted to NeurIPS 2021 Dataset and Benchmarks and used for the FEVER Workshop Shared Task at EMNLP2021.
Apache License 2.0
67 stars 20 forks source link

Make sure that that annotation titles do not need to be NFD normalized and cleaned by systems themselves #2

Closed Raldir closed 3 years ago

Raldir commented 3 years ago

Some cases in annotations might require to clean titles to match wiki-dump. Fix.

from cleantext import clean
import unicodedata

def clean_title(text):
    text = unquote(text)
    text = clean(text.strip(),fix_unicode=True,               # fix various unicode errors
    to_ascii=False,                  # transliterate to closest ASCII representation
    lower=False,                     # lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    no_urls=True,                  # replace all URLs with a special token
    no_emails=False,                # replace all email addresses with a special token
    no_phone_numbers=False,         # replace all phone numbers with a special token
    no_numbers=False,               # replace all numbers with a special token
    no_digits=False,                # replace all digits with a special token
    no_currency_symbols=False,      # replace all currency symbols with a special token
    no_punct=False,                 # remove punctuations
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       # set to 'de' for German special handling
    )
    return text

def get_wiki_page_from_title(page, db):
    page = clean_title(page)
    page = unicodedata.normalize('NFD', page)
    lines = db.get_doc_json(page)
    wiki_page = WikiPage(page, lines)
    return pa
Raldir commented 3 years ago

Fixed with updated annotation files on the shared task page.