Living-with-machines / DeezyMatch

A Flexible Deep Learning Approach to Fuzzy String Matching
https://living-with-machines.github.io/DeezyMatch/
Other
139 stars 34 forks source link

Feature/50 alias detection on fly #68

Closed kasra-hosseini closed 4 years ago

kasra-hosseini commented 4 years ago

In this PR, the main contribution is to perform alias detection on-the-fly:

@mcollardanuy @fedenanni Here is the very fist version of alias detection on-the-fly. The design can be improved, but currently, I call test_tokenize and test_model at the start to generate temporary vectors for a string or a list of query strings (see query_vector_gen in utils_candidate_ranker), then, we combine and use them in candidateRanker (as before). After combining the vectors, I remove the temporary directory.

The idea behind this design choice was compatibility with what we already had. There are still some issues/improvements that we should make:

path1_combined = os.path.join(query_scenario, "query_fwd_0")
path2_combined = os.path.join(query_scenario, "query_bwd_0")
path_id_combined = os.path.join(query_scenario, "query_indxs_0")

We need to change this so that many temporary query files can be combined, similar to combineVecs function.

kasra-hosseini commented 4 years ago

@mcollardanuy @fedenanni For testing:

from DeezyMatch import candidate_ranker

# Find candidates
candidates_pd = \
    candidate_ranker(scenario="./combined/test/",
                     query=["mariona", "fede", "kasra"],
                     ranking_metric="conf", 
                     selection_threshold=0.8, 
                     num_candidates=10, 
                     search_size=1000, 
                     output_filename="test_candidates_deezymatch", 
                     pretrained_model_path="./models/finetuned_test001/finetuned_test001.model", 
                     pretrained_vocab_path="./models/finetuned_test001/finetuned_test001.vocab", 
                     number_test_rows=20)
kasra-hosseini commented 4 years ago

@fedenanni Thanks for the review. I tried to answer all your comments. Could you please take a look? if you are happy, please mark them as resolved.

mcollardanuy commented 4 years ago

Hi @kasra-hosseini, I'm done with the review. Great additions! The on-the-fly alias detection will be super useful. Let me know if you want to discuss anything, especially regarding directory structures, or there's anything I can help with. Thanks again!

mcollardanuy commented 4 years ago

Hi @kasra-hosseini, all looks good! 👍

kasra-hosseini commented 4 years ago

@mcollardanuy Now, we log the function args in log.txt: https://github.com/Living-with-machines/DeezyMatch/pull/68/commits/414b9f1f13da078fdbcf5a376f16063192cddca1. Could you please take a look?

mcollardanuy commented 4 years ago

@mcollardanuy Now, we log the function args in log.txt: 414b9f1. Could you please take a look?

That's perfect, thanks!