compenguy / ngrammatic

A rust crate providing fuzzy search/string matching using N-grams
MIT License
25 stars 7 forks source link

Adding limit parameter + serde + updated readme + const arity #10

Closed LucaCappelletti94 closed 2 months ago

LucaCappelletti94 commented 5 months ago

With this pull request:

  1. I take care of issue #9
  2. Added some more documentation
  3. Integrated GitHub action-based clippy and test suite
  4. Integrated badges and now README is part of documentation, so there cannot be examples in the README that are not working. Note that the Github action badge will only start to work once you merge this pull request, as it points to a currently not existing action.

Cheers!

LucaCappelletti94 commented 5 months ago

Now I have also added support for serde.

LucaCappelletti94 commented 5 months ago

And now added const generic, similarly to what you are doing in this pull request. Updated also all tests, which are all passing.

compenguy commented 5 months ago

I'm pretty happy with the changes overall, just some minor fixes for outdated comments, and trying to close the (unlikely) gap of a 0-arity ngram.

LucaCappelletti94 commented 5 months ago

I am trying to understand which areas that require a non-trivial amount of memory may be trimmed a bit.

For instance, I believe there is some redundancy in the hashmap strings, and having both the text and padded text. Likely, using a trie data structure may significantly reduce the memory footprint of the hashmap.

First thing, I am setting up some benchmarks so that I understand better where I am starting with the optimization. Do you have any particular insight regarding what may be trimmed first?

7.875 GB 100.00% ⏺: ngrammatic::Corpus<ngrammatic::key_transformer::LinkedKeyTransformer<ngrammatic::key_transformer::IdentityKeyTransformer, ngrammatic::key_transformer::LowerKeyTransformer>, 2>
   24  B   0.00% ├╴pad_left: ngrammatic::Pad
                 │ ╰╴Variant: Auto
   24  B   0.00% ├╴pad_right: ngrammatic::Pad
                 │ ╰╴Variant: Auto
4.365 GB  55.43% ├╴ngrams: std::collections::hash::map::HashMap<alloc::string::String, ngrammatic::Ngram<2>>
3.510 GB  44.57% ├╴gram_to_words: std::collections::hash::map::HashMap<alloc::string::String, alloc::vec::Vec<alloc::string::String>>
    0  B   0.00% ╰╴key_transformer: ngrammatic::key_transformer::LinkedKeyTransformer<ngrammatic::key_transformer::IdentityKeyTransformer, ngrammatic::key_transformer::LowerKeyTransformer>
    0  B   0.00%   ├╴src: ngrammatic::key_transformer::IdentityKeyTransformer
    0  B   0.00%   ╰╴dst: ngrammatic::key_transformer::LowerKeyTransformer
LucaCappelletti94 commented 5 months ago

@compenguy if you are curious, you can take a look at the ongoing benchmarks here: https://github.com/LucaCappelletti94/ngrammatic/tree/master/benchmarks

LucaCappelletti94 commented 5 months ago

We are now down from the initial 7.875 GB to 439.6 MB! Still a lot of space at the bottom.

compenguy commented 5 months ago

@compenguy if you are curious, you can take a look at the ongoing benchmarks here: https://github.com/LucaCappelletti94/ngrammatic/tree/master/benchmarks

I see benchmark times for building the corpus, but how does this affect lookups? I would generally expect corpus building to be a mostly one time during startup type of task where bandwidth matters, while lookups tend to be tied into user interaction, and would be more latency sensitive.

LucaCappelletti94 commented 5 months ago

I am first focusing on getting the file size down to something manageable. Also the queries are now much faster, but I still have to do a comprehensive benchmark for them. Right now I am experimenting with a webgraph data structure, we will see how far down we can get.

LucaCappelletti94 commented 4 months ago

Hi @compenguy! The optimization is done, I think. Do check out the search time benchmarks here and the memory benchmarks here. I only need to rewrite the README describing a bit more of the new stack, and then it is ready.

LucaCappelletti94 commented 4 months ago

README is now updated. Let me know what you think of the whole thing!

compenguy commented 4 months ago

It looks like there are merge conflicts that need resolving. I haven't yet looked at the conflicts, buy the commit history shows the evolution of your thinking, with your mistakes and mistteps along the way, which tends to make resolving conflicts a lot more work.

I'm going to ask that you either rebase your work to resolve the conflicts, or rework your commit history so that each commit reflects a single logical change that as you mean for it to be in the final state. I would prefer the latter - it's much kinder not just on code reviewers, but on anyone who needs to read through the code history - but that's a lot of work and I didn't advise you of this preference earlier on.

LucaCappelletti94 commented 4 months ago

Apologies for the merge conflicts; I underestimated their likelihood with GitHub pull requests as it did not report conflicts there, but there were most definitely some as I started with the rebase.

The rebase is now complete. However, restructuring the commit history would be quite labour-intensive, so I've avoided it for now.

I've organized logical change commits for the initial phase. However, the optimization stage necessitated a more experimental approach. You can find detailed notes on my experiments in the benchmarks.

compenguy commented 2 months ago

I merged this PR to make it easier for you to work on. There's still issues with it, including test failures:

failures:
    src/corpus_par_from.rs - corpus_par_from::Corpus<KS,NG,K,WeightedBitFieldBipartiteGraph>::par_from (line 111)
    src/lib.rs - (line 103)
    src/lib.rs - (line 158)
    src/lib.rs - (line 187)
    src/lib.rs - (line 217)
    src/lib.rs - (line 246)
    src/lib.rs - (line 280)
    src/lib.rs - (line 301)
    src/lib.rs - (line 343)
    src/lib.rs - (line 375)
    src/lib.rs - (line 471)
    src/lib.rs - (line 580)
    src/lib.rs - (line 79)

test result: FAILED. 119 passed; 13 failed; 0 ignored; 0 measured; 0 filtered out; finished in 11.39s

Once things are clean and looking good, I'll publish it to crates.io. If this goes silent for like 3 months, then I'll probably revert this.

LucaCappelletti94 commented 1 month ago

Ok! I will look into the issues mentioned.