Daniel-Liu-c0deb0t / triple_accel

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.
MIT License
103 stars 13 forks source link

Implement Jaccard index? #2

Closed luizirber closed 4 years ago

luizirber commented 4 years ago

Hi @Daniel-Liu-c0deb0t!

I didn't dig down too much into the codebase, but I wonder if there is interest in adding a Jaccard index function (as described in Faster Population Counts Using AVX2 Instructions. The implementation from the paper is at https://github.com/CountOnes/hamming_weight/blob/1dd7554c0fc39e01c9d7fa54372fd4eccf458875/src/avx_jaccard_index.c and while it is a bit heavy on the macros I think it could be translated using the infra you already have in triple_accel?

(This would be incredibly useful for me, so I can start looking deeper into it if you're also interested =])

Daniel-Liu-c0deb0t commented 4 years ago

Hey @luizirber, I think it would be a bit difficult to integrate Jaccard distance into the current codebase, since it is mainly adapted for the Levenshtein distance calculation. It would likely be easier to make a separate crate for Jaccard index. I considered this feature before, but I thought it would be better to only focus on edit distance. Additionally, I'm current too busy with other projects to implement it. Sorry. Eventually, I would likely implement SIMD popcount in Rust in a separate project, since I don't believe anyone has done it yet and its very useful. If you have time to look into it, then I think it shouldn't be too hard to get a Rust port, as the code is not too long. The dense macros are just to make it work on multiple platforms.

luizirber commented 4 years ago

No problem =]

Daniel-Liu-c0deb0t commented 4 years ago

Hey @luizirber, I am interested in implementing an SSE/AVX popcount routine in another project: https://github.com/natir/nuc2bit. Perhaps you can use the code (when I am done) to implement a Jaccard distance function?

luizirber commented 4 years ago

Sounds great, and happy to see you working with Pierre =]

Daniel-Liu-c0deb0t commented 4 years ago

Hey, I wrote up a quick popcount implementation here using the shuffle-based algorithm. For Jaccard distance, you are going to have to modify the code to AND and OR the vectors before applying internal_popcount, in the innermost loop.

I haven't thoroughly benchmarked my implementation since I just finished it, but it should be pretty fast.