Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
362 stars 76 forks source link

Tips for working with large datasets #88

Open ryangdar opened 1 year ago

ryangdar commented 1 year ago

Hi I'm working with a 200MB file and using the command group_similar_strings, however, this is taking so long that it's never completing (running for several days). I've tried several n_gram sizes with no luck. Do you have any tips to run on large datasets?

ajinnah commented 1 year ago

Having the same issue with no solution so far, https://github.com/louistsiattalou/tfidf_matcher can handle much larger datasets without getting stuck.