BurntSushi / suffix

Fast suffix arrays for Rust (with Unicode support).
The Unlicense
263 stars 30 forks source link

Clippy #20

Closed chris-ha458 closed 1 year ago

chris-ha458 commented 1 year ago

some parts are old. I tried to do a clippy run.

all tests pass

BurntSushi commented 1 year ago

Thanks, LGTM. I would say hold off. I'm not really sure it's worth investing much of our time here. I don't think anyone is using this code.

chris-ha458 commented 1 year ago

Thanks, LGTM. I would say hold off. I'm not really sure it's worth investing much of our time here. I don't think anyone is using this code.

Are you speaking about this particular part of the code or rust based suffix array in general?

I am looking for unicode aware suffix code implementations and this is one of the most recent and well maintained ones.

currently tokenizers uses esaxx crate which is both a wrapper around cpp based esaxx and a line by line rust port of it. The rust version is much slower than the c version. I think it could be rewritten in idiomatic rust to make it better, and wanted to look into other rust implementations.

That led me here. I'm trying to understand this code, and make changes if necessary. The end goal would be to use it if possible, or learn from it to reimplement. (I don't think I need to tell you, or anybody that writing a suffix array groud up is very difficult haha)

Of course if you feel like that is not a good idea for any reason, or would like me to open a separate issue to discuss this, feel free to let me know.

BurntSushi commented 1 year ago

I see.

The reality is that very few people are using this crate at all for any reason. I myself never found a satisfying way to use it. The other reality is that it's been literal years since I've looked at this code---or even thought about it---at any depth. On top of that, I am stretched thin with other projects. What this means is that while I found a little time to review a simple Clippy PR, I'm not going to be able to find time to review anything more substantial than that. That in turn means that if you want to use this code and fix it up or adapt it or improve it, then my suggestion would be to fork it into your own crate.

If you go down that path and it is still being maintained a ~year from now, ping me and I can deprecate this crate and point folks your way.

currently tokenizers uses esaxx crate which is both a wrapper around cpp based esaxx and a line by line rust port of it. The rust version is much slower than the c version. I think it could be rewritten in idiomatic rust to make it better, and wanted to look into other rust implementations.

How does this crate compare to the C version?

chris-ha458 commented 1 year ago

How does this crate compare to the C version? I'm not sure if were asking wrt esaxx_rs ( C and rust) vs suffix (only rust) or the original esaxx (Cpp only) vs esaxx_rs, but I'll answer both.

esaxx uses the yuta mori's sais-lite to implement a suffix tree. Both are cpp based. Due to extensive code and input output format differences none are really comparable to any rust versions. Most are fast and relatively safe but have hidden memory unsafe issues. Due to the complexity of the code and implementation it would be difficult for most fuzzing or static analysis to catch them all.

esaxx was used in the sentencepiece tokenizer to implement the unigram tokenizer algorithm

esaxx_rs is Narsil's (I hesitate to ping him since he is also a very busy rust developer working on tokenizers and transformers at huggingface) attempt to reimplement esaxx into rust so as to be able to use it in the rust based tokenizers library.

There are two implementations within esaxx_rs with one being a thin wrapper around the c code and another being an almost line by line port of the rust code. It still managed to find some memory safety bugs but due to extensive compiler checks that are silently introduced due to extensive memory and array accesses it is 2~3 times slower than the c-wrapper version.

From what I can tell, suffix array is very idiomatical rust approach to suffix arrays. The underlying algorithm is similar (same?) as well.

Both are unicode aware as far as I can tell. The c wrapper version is unicode aware as well (AFAICT) because it is fed data structures from rust types.

Due to subtle differences in input and output types and datastructures, esaxx_rs and suffix are not drop in compatible with each other.

The reality is that very few people are using this crate at all for any reason. I myself never found a satisfying way to use it.

I find a way, atleast for testing and exploration purposes.

The other reality is that it's been literal years since I've looked at this code---or even thought about it---at any depth. On top of that, I am stretched thin with other projects. What this means is that while I found a little time to review a simple Clippy PR, I'm not going to be able to find time to review anything more substantial than that. That in turn means that if you want to use this code and fix it up or adapt it or improve it, then my suggestion would be to fork it into your own crate.

I understand your situation and would like to express my gratitude for you making this crate, and for working on many other foundational crates in the rust open-source ecosystem.

Exactly because I recognize the weight and responsibility of an undertaking such as this I would have to say it is unlikely that I would make something totally new, but if I ever get around to atleast address either crate to a stage that could be independently maintained by me or anybody who can, I'll let you know.

In the meantime, I might post some issues or PRs from time to time here, but do not feel any urgencies to respond or address them. If I do, it would be under the express understanding that you are very busy and have limited bandwidth and would be made accordingly.

Of course if you'd rather I not open any issues or PRs at all, even if it were for communicating information to others that might be looking in here, than I will respect that as well.