Living-with-machines / DeezyMatch

A Flexible Deep Learning Approach to Fuzzy String Matching
https://living-with-machines.github.io/DeezyMatch/
Other
139 stars 34 forks source link

Define word token separators in the input file #78

Closed mcollardanuy closed 2 years ago

mcollardanuy commented 4 years ago

At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the .split() function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).

Allow the user to define word token separators in the input file.

You will need to change the following code or files:

  1. string_split function in utils.py (see here):

    
      # ------------------- string_split --------------------
     def string_split(x, tokenize=["char"], min_gram=1, max_gram=3):
       """
       Split a string using various methods.
       min_gram and max_gram are used only if "ngram" is in tokenize
       """
       tokenized_str = []
       if "char" in tokenize:
         tokenized_str += [sub_x for sub_x in x]
    
       if "ngram" in tokenize:
         for ngram in range(min_gram, max_gram+1):
           tokenized_str += [x[i:i+ngram] for i in range(len(x)-ngram+1)] 
    
       if "word" in tokenize:
         tokenized_str += x.split()
    
       return tokenized_strc
  2. data_processing.py, lines 105-113 (see here):

     cprint('[INFO]', bc.dgreen, "-- create vocabulary")
     dataset_split["s1_unicode"] = dataset_split["s1_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
     dataset_split["s2_unicode"] = dataset_split["s2_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
  3. The mode section in the input file (see here):

     mode:    # Tokenization mode
     # choices: "char", "ngram", "word"
     # for example: tokenize: ["char", "ngram", "word"] or ["char", "word"] 
     tokenize: ["char"]
     # ONLY if "ngram" is selected in tokenize, the following args will be used:
     min_gram: 2
     max_gram: 3
mcollardanuy commented 2 years ago

Solved in https://github.com/Living-with-machines/DeezyMatch/pull/111.