andylokandy / simsearch-rs

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).
MIT License
167 stars 25 forks source link

too slow insert on large set of data #11

Closed estin closed 3 years ago

estin commented 3 years ago

Hi!

I will try to parse and build engine on cities500.text (196648 cities) from free data by http://download.geonames.org/export/dump/ And found it's very slow process...

https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L136

It's really needed to delete key on each insert?

And delete is too slow by self https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L262

Why not used HashMap to map id to id_num?

When insert many items to engine on each insert searching id_num on whole list of inserted keys before

And I can make PR to fix, but no sure, what I understand this behavior right

Sorry for my poor English.

andylokandy commented 3 years ago

@estin Yes, it is indeed a silly implementation. When I created this crate, I was going to search for words in 100 sentences, so it hadn't been a problem. Very appreciate hearing that you would help writing up the PR! I'm happy to do that improvement.