Fix tokenizer - Githubissues

Living-with-machines / DeezyMatch

A Flexible Deep Learning Approach to Fuzzy String Matching

https://living-with-machines.github.io/DeezyMatch/

Other

139 stars 34 forks source link

Fix tokenizer #111

Closed mcollardanuy closed 2 years ago

mcollardanuy commented 2 years ago

Issues addressed:

Fix bug with test tokenize (https://github.com/Living-with-machines/DeezyMatch/issues/109)
Define word token separators in input file (https://github.com/Living-with-machines/DeezyMatch/issues/78)
Prefix/suffix parameter moved as part of mode, not preprocessing, as it applied to subword tokenization

review-notebook-app[bot] commented 2 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mcollardanuy commented 2 years ago

Hi @kasra-hosseini, could you have a look at this PR? These are the main changes:

Fix bug with test tokenize (https://github.com/Living-with-machines/DeezyMatch/issues/109)
Define word token separators in input file (https://github.com/Living-with-machines/DeezyMatch/issues/78)
Prefix/suffix parameter moved as part of mode, not preprocessing, as it applied to subword tokenization

I will start looking into the issue about the second and third columns for vector generation separately.

Let me know if something is not clear. Thank you!

kasra-hosseini commented 2 years ago

@mcollardanuy Could you please take a look at the tests here: https://github.com/Living-with-machines/DeezyMatch/blob/feature/109-improve-tokenizer/DeezyMatch/tests/test_utils.py? Are these the expected outputs?

kasra-hosseini commented 2 years ago

@mcollardanuy I also added another test for https://github.com/Living-with-machines/DeezyMatch/issues/109, see: https://github.com/Living-with-machines/DeezyMatch/blob/feature/109-improve-tokenizer/DeezyMatch/tests/test_pipeline_ngram.py.

The only difference to test_pipeline.py is that we set tokenize: ["ngram"] instead of tokenize: ["char", "word"]

kasra-hosseini commented 2 years ago

If you are happy with https://github.com/Living-with-machines/DeezyMatch/commit/70d4654e004ebcf8cd0d92df32bec8c636061ea5, please go ahead and merge with the develop branch.

mcollardanuy commented 2 years ago

Hi @kasra-hosseini, thanks! There's just one case in which I think we're not getting the expected output. And I would add a test using two work token separators. Other than that, everything looks good to me!

kasra-hosseini commented 2 years ago

@mcollardanuy Great, thanks again. I tried to address your comments. Could you please take a look at them? Is there anything else that you think we should address in this PR?

mcollardanuy commented 2 years ago

Hi @kasra-hosseini, thanks, all looks good!