camino-rs / camino

Like Rust's std::path::Path, but UTF-8.
https://docs.rs/camino
Apache License 2.0
435 stars 16 forks source link

Gauging level of interest in using simdutf8 #17

Open mjc opened 2 years ago

mjc commented 2 years ago

I have a project that does a bunch of path manipulation and benefits a bit from using a wrapper with the simdutf8 crate.

If I find time, and there is interest, I would happily work on a PR to add optional simdutf8 support. It's not ready for production so it definitely should be optional.

I'd love to be able to simplify a bunch of this code the way camino would normally let me.

sunshowers commented 2 years ago

Thanks for the issue! I'm totally open to this as an optional feature, since almost all paths are valid UTF-8. For the PR, could you include some benchmarks + results you see on them (I like the criterion library)?

sunshowers commented 2 years ago

It's not ready for production so it definitely should be optional.

Could you elaborate on what this means? This is a bit concerning to me.

mjc commented 2 years ago

Actually, it looks like it's just the documentation being out of date! My bad.

From the README:

"This library has been thoroughly tested with sample data as well as fuzzing and there are no known bugs."

https://github.com/rusticstuff/simdutf8

sunshowers commented 2 years ago

Thanks! It also looks like it switches to std for fewer than 64 bytes. I'm really curious how much of a benefit this is in real world use because most paths (except I guess on Windows) tend to be shorter than 64 bytes.

mjc commented 2 years ago

Yeah, it's probably not a huge benefit for most cases.

In my particular use case most of the paths are for a cache server, so they are broken up like this (and usually north of 100 characters):

/tank/application/usa/east/ny/cache/c/f/2/3/cf23df2207d99a74fbe169e3eba035e633b65d94

And then there's a fair amount of other path munging in the hot path. I get about a 1% per request boost out if it, which I suspect is the best possible case.

I totally understand if that doesn't feel worth the maintenance burden, you are right that most paths aren't that long.

sunshowers commented 2 years ago

Yeah, I think the benefits are with longer paths more than anything else, and I'm worried that the branching for shorter paths will reduce perf (will have to see this in benchmarks of course.)

How about we leave this issue open for now and see if others are interested in it? (If you're working within a corporate environment, feel free to patch camino locally of course!)

sunshowers commented 2 years ago

Having thought about this for a bit, I'm quite interested in seeing what this would look like. Could you put up some benchmarks for paths of various sizes and ASCII vs non-ASCII, etc? camino 1.0.8 has a proptest1 feature which might be useful for generating random paths.

cgwalters commented 2 years ago

/tank/application/usa/east/ny/cache/c/f/2/3/cf23df2207d99a74fbe169e3eba035e633b65d94

You may benefit from using https://github.com/bytecodealliance/cap-std/ - on modern Linux then you can open a directory fd (Dir) for /tank/application/usa/east/ny/cache/, then the resulting filenames are much smaller and more amenable to stack allocation even.