Living-with-machines / DeezyMatch

A Flexible Deep Learning Approach to Fuzzy String Matching
https://living-with-machines.github.io/DeezyMatch/
Other
139 stars 34 forks source link

Test Tokenizer does not handle n-gram #109

Closed YuhengHuang42 closed 2 years ago

YuhengHuang42 commented 2 years ago

Hi, first of all, thanks for your great work!

During using this tool, I found two places that are confusing to me.

In https://github.com/Living-with-machines/DeezyMatch/blob/6515790746589095e8430970b766deaf5499437c/DeezyMatch/data_processing.py#L105 preprocess for n-gram is done. However, for https://github.com/Living-with-machines/DeezyMatch/blob/6515790746589095e8430970b766deaf5499437c/DeezyMatch/data_processing.py#L173 there is no similar preprocessing.

As a result, if you enable n-gram in the first place, there will be an error at inference time. (All samples ignored because the vocabulary is based on n-gram)


Also, the vector generation process is a little confusing. Because the information for the second column and label column should not be required. However, https://github.com/Living-with-machines/DeezyMatch/blob/6515790746589095e8430970b766deaf5499437c/DeezyMatch/rnn_networks.py#L649 the forward function seems to require two inputs.

kasra-hosseini commented 2 years ago

Hi and thanks so much @YuhengHuang42 for reporting these. We will address both issues/suggestions by this Friday. (And sorry for the delay, I just got back from holiday).

kasra-hosseini commented 2 years ago

Hi @YuhengHuang42 , thank you again for reporting the issue with n-grams and for your suggestions. It took us a bit longer than planned, but we just released v1.3.0 in which:

kasra-hosseini commented 2 years ago

@YuhengHuang42 Could you please take a look at the new version and let us know if there are any other issues? Thank you!

YuhengHuang42 commented 2 years ago

@YuhengHuang42 Could you please take a look at the new version and let us know if there are any other issues? Thank you!

Hi, thanks for following up on this issue. I have tested the code on my server, everything seems to work now. Except there are two problems:

  1. In the config file(.yaml)
gru_lstm:
  main_architecture: "gru"    # rnn, gru, lstm
  mode:    # Tokenization mode
    token_sep: "default"
    prefix_suffix: ["|", "|"]

We need to add these two lines, otherwise, there might be KeyError. But I think this is expected behavior.

  1. For one_column_inp:

It seems for now the one-column insertion is done in the DeezyMatch/data_processing.py file. This is done by:

tmp_split_row.insert(1, "tmp")

However, there might be some special cases that "t", "m", "p" are not in the vocabulary. So, in the end, there might still be some problems. But if the target NLP task is in English, I think this is also OK.

kasra-hosseini commented 2 years ago

Hi, thanks for your quick test and for your comments.

We need to add these two lines, otherwise, there might be KeyError. But I think this is expected behavior.

there might be some special cases that "t", "m", "p" are not in the vocabulary. So, in the end, there might still be some problems. But if the target NLP task is in English, I think this is also OK.

kasra-hosseini commented 2 years ago

@YuhengHuang42 What do you think about this solution: https://github.com/Living-with-machines/DeezyMatch/pull/118

YuhengHuang42 commented 2 years ago

@YuhengHuang42 What do you think about this solution: #118

Looks good to me :)

kasra-hosseini commented 2 years ago

👍 Great. We will do some more tests today and will merge the PR.

kasra-hosseini commented 2 years ago

Solved in v1.3.1. I close this, but of course, please feel free to re-open or open a new issue if needed.