At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the .split() function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).
Allow the user to define word token separators in the input file.
You will need to change the following code or files:
# ------------------- string_split --------------------
def string_split(x, tokenize=["char"], min_gram=1, max_gram=3):
"""
Split a string using various methods.
min_gram and max_gram are used only if "ngram" is in tokenize
"""
tokenized_str = []
if "char" in tokenize:
tokenized_str += [sub_x for sub_x in x]
if "ngram" in tokenize:
for ngram in range(min_gram, max_gram+1):
tokenized_str += [x[i:i+ngram] for i in range(len(x)-ngram+1)]
if "word" in tokenize:
tokenized_str += x.split()
return tokenized_strc
mode: # Tokenization mode
# choices: "char", "ngram", "word"
# for example: tokenize: ["char", "ngram", "word"] or ["char", "word"]
tokenize: ["char"]
# ONLY if "ngram" is selected in tokenize, the following args will be used:
min_gram: 2
max_gram: 3
At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the
.split()
function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).Allow the user to define word token separators in the input file.
You will need to change the following code or files:
string_split
function inutils.py
(see here):data_processing.py
, lines 105-113 (see here):The
mode
section in the input file (see here):