compenguy / ngrammatic

A rust crate providing fuzzy search/string matching using N-grams
MIT License
25 stars 7 forks source link

Serialisation of a corpus #8

Closed claudius108 closed 5 months ago

claudius108 commented 1 year ago

Hi,

Thanks for this very nice library! It recognizes very well "tattvacintam" when the corpus contains "???avaci??am".

My question is that if such a corpus could be stored on disk. I would like to index some strings and use the corpus in the browser, with the library compiled to Webassembly.

Best regards! Claudius Teodorescu

compenguy commented 1 year ago

It would be possible. The most storage-efficient serialization of a corpus is just the list of words, from which the full corpus can be rebuilt. The most cpu-efficient serialization is the full mappings of words-to-ngrams and ngrams-to-words. Which means the Serialize trait is easy, but to implement the Deserialize trait we would need a key translation function.

In order to solve the Deserialize problem, we would need to make a breaking change to make the key translation function optional, and add methods to set a key translation function after the fact.

claudius108 commented 1 year ago

Thank you!

I have a list of words with question marks instead of diacritics. I will test the first approach, if the list is not that large, and I will let you know about the results.

All the best, Claudius

On Sun, 16 Apr 2023, 04:39 Will Page, @.***> wrote:

It would be possible. The most storage-efficient serialization of a corpus is just the list of words, from which the full corpus can be rebuilt. The most cpu-efficient serialization is the full mappings of words-to-ngrams and ngrams-to-words. Which means the Serialize trait is easy, but to implement the Deserialize trait we would need a key translation function.

In order to solve the Deserialize problem, we would need to make a breaking change to make the key translation function optional, and add methods to set a key translation function after the fact.

— Reply to this email directly, view it on GitHub https://github.com/compenguy/ngrammatic/issues/8#issuecomment-1510017825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANKHHRJF5KAVMR7HHXZIY3XBNEVFANCNFSM6AAAAAAW7URNHY . You are receiving this because you authored the thread.Message ID: @.***>

LucaCappelletti94 commented 5 months ago

I have implemented the serde serialization/deserialization by swapping the boxed function with chainable key transformers. See pull request #10

claudius108 commented 5 months ago

Thank you!

LucaCappelletti94 commented 5 months ago

I just finished adding the necessary extra bells and whistles to trie-rs and I will be testing shortly how much memory we can save by switching from the hashmap to that. Will keep you posted. Until @compenguy consider merging my pull request, in the meantime you can use my fork here: https://github.com/LucaCappelletti94/ngrammatic/tree/master

claudius108 commented 5 months ago

Hi!

I think this can be also useful: https://crates.io/crates/fst

I am using it for browser-based search engines. It allows fast regex search, prefix search, and more.

Claudius

On Mon, 1 Apr 2024, 19:01 Luca Cappelletti, @.***> wrote:

I just finished adding the necessary extra bells and whistles to trie-rs https://github.com/LucaCappelletti94/trie-rs and I will be testing shortly how much memory we can save by switching from the hashmap to that. Will keep you posted. Until @compenguy https://github.com/compenguy consider merging my pull request, in the meantime you can use my fork here: https://github.com/LucaCappelletti94/ngrammatic/tree/master

— Reply to this email directly, view it on GitHub https://github.com/compenguy/ngrammatic/issues/8#issuecomment-2030055540, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANKHHR3AD3EGWOI5KD2MP3Y3GAFJAVCNFSM6AAAAAAW7URNH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGA2TKNJUGA . You are receiving this because you modified the open/close state.Message ID: @.***>