OpenMined / SyferText

A privacy preserving NLP framework
Apache License 2.0
198 stars 49 forks source link

Bug in tokenization when specifying the tokenizer infixes #182

Closed AlanAboudib closed 3 years ago

AlanAboudib commented 4 years ago

Description

Here is a weird bug

How to Reproduce

# Imports
import syfertext
from syfertext.tokenizer import Tokenizer

import syft as sy
import torch
hook = sy.TorchHook(torch) 
me = hook.local_worker
me.is_client_worker = False

# Create a pipeline using a Language object
nlp_test1 = syfertext.create(pipeline_name = "pipeline_test1")

# Create a tokenizer
tokenizer = Tokenizer(suffixes=['$'],
                      prefixes = ['('],
                      infixes = ['.'],
                      exceptions = {"melo": [{"ORTH":"me"}, {"ORTH":"lo"}]}
                     )

# Add the tokenizer to the pipeline
nlp_test1.set_tokenizer(tokenizer = tokenizer)

doc = nlp_test1("'Dr.doom! is  ({token-izing a python! str$ing$' melo")

for token in doc:
    print(token)

Expected Behavior

Tokenization shouldn't produce an error

Nilanshrajput commented 4 years ago

its only coming with '.' this matches with each alphabets, its probably some special character investigating this

Nilanshrajput commented 4 years ago

@AlanAboudib you need to use like this

tokenizer = Tokenizer(suffixes=['$'],
                      prefixes = ['('],
                      infixes = ['\.'],
                      exceptions = {"melo": [{"ORTH":"me"}, {"ORTH":"lo"}]}
                     )

'.' is special character

Nilanshrajput commented 4 years ago

Although this error shouldn't come even with this usage, it should tokenize every letter separately. I am looking into it