Fix tokenizer for reuters dataset

castorini / castor

PyTorch deep learning models for text processing

http://castor.ai/

Apache License 2.0

178 stars 58 forks source link

Fix tokenizer for reuters dataset #153

Open Ashutosh-Adhikari opened 5 years ago

Ashutosh-Adhikari commented 5 years ago

Need to remove a few characters ( like ?, ! ) from sentences. In other words, add a few relevant delimiters.

achyudh commented 5 years ago

Take a look at datasets/reuters.py. Removing the special characters from the regular expression should do what you want.

def clean_string(string):
    """
    Performs tokenization and string cleaning for the Reuters dataset
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'`]", " ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.lower().strip().split()