Define word token separators in the input file

At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the .split() function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).

Allow the user to define word token separators in the input file.

You will need to change the following code or files:

string_split function in utils.py (see here):


  # ------------------- string_split --------------------
 def string_split(x, tokenize=["char"], min_gram=1, max_gram=3):
   """
   Split a string using various methods.
   min_gram and max_gram are used only if "ngram" is in tokenize
   """
   tokenized_str = []
   if "char" in tokenize:
     tokenized_str += [sub_x for sub_x in x]

   if "ngram" in tokenize:
     for ngram in range(min_gram, max_gram+1):
       tokenized_str += [x[i:i+ngram] for i in range(len(x)-ngram+1)] 

   if "word" in tokenize:
     tokenized_str += x.split()

   return tokenized_strc

data_processing.py, lines 105-113 (see here):

 cprint('[INFO]', bc.dgreen, "-- create vocabulary")
 dataset_split["s1_unicode"] = dataset_split["s1_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
 dataset_split["s2_unicode"] = dataset_split["s2_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))

The mode section in the input file (see here):

 mode:    # Tokenization mode
 # choices: "char", "ngram", "word"
 # for example: tokenize: ["char", "ngram", "word"] or ["char", "word"] 
 tokenize: ["char"]
 # ONLY if "ngram" is selected in tokenize, the following args will be used:
 min_gram: 2
 max_gram: 3

Living-with-machines / DeezyMatch

Define word token separators in the input file #78