BurntSushi / bstr

A string type for Rust that is not required to be valid UTF-8.
Other
802 stars 56 forks source link

RFC: 1.0 release? #40

Closed BurntSushi closed 2 years ago

BurntSushi commented 4 years ago

For those coming here that don't know what bstr is: it is a string library for &[u8]. The essential difference between the strings in bstr and the &str type in std is that bstr treats &[u8] as conventionally UTF-8 instead of requiring that it be UTF-8. Its main utility is in contexts where you believe your data is UTF-8 (but it might not be completely UTF-8) and you either don't have any information about what its actual encoding is or do not want to pay for the UTF-8 validity check. A common example of this is reading data from files. The bstr documentation says a lot more.

This issue is about releasing 1.0. Since I do not currently have any plans for a 2.0, I would like to get as many eyes on this as possible. If you have any feedback with respect to API breaking changes, I would love to hear about it.

OK, so I promise that the 1.0 release is imminent. Here is an exhaustive list of planned breaking API changes, all of which are currently present on master (some brought in via #104, others via #123):

Assuming this is an exhaustive list, and given that these are all very minor changes, I'm hopefully that migration from 0.2 to 1.0 will be very easy. Hopefully requiring no changes in most cases.

My plan is to release 1.0 on ~July 11, 2022~ ~July 18, 2022~ September 6, 2022. If you have feedback to give, please do so now. :-)

Note that I've published 1.0.0-pre.3 to crates.io. Docs: https://docs.rs/bstr/1.0.0-pre.3

(Below is the message I initial wrote a couple years ago. I'm way late to the party.)


It has been almost a year since I released bstr 0.2 with the major breaking change of moving most of the routines to extension traits. It seems like this has been a success. Namely, bstr's reverse dependency list is growing. More generally, I personally like working with the new API better than the old one. While I still bemoan the loss of a distinct type and its corresponding Debug impl as the One True Byte String, I think the benefits of the extension trait API have ended up outweighing that cost.

I've been giving some thought to bstr's API and its future evolution, and nothing immediately comes to mind in terms of breaking changes. That is, most everything I can think of are API additions, or at worst, deprecations. The only breaking change I can think of is to more carefully audit which API routines are available when unicode mode is disabled. I just want to make sure I'm not boxing myself into any corners there. (e.g., Some extant implementations might currently rely on std for its Unicode support, but it may wind up being the case that we want to re-implement some of those, which will require bringing in our own Unicode data. If that occurs, then those APIs should be gated behind the unicode feature.)

Otherwise, my feeling is that, unless I hear otherwise, I will make a 1.0 release in a few months. June 2020 is the 1 year anniversay of the 0.2 release, so that sounds like a good a time as any.

Thoughts?

cc @thomcc I know you've done some work on bstr and are actually using it, so would definitely appreciate if you have any thoughts here! Mostly what I'm looking for are things that we might want to do that will break the current API. While 1.0 doesn't necessarily mean "breaking changes must stop," I generally try to commit to a long termish period of stability for each major version in core libraries.

thomcc commented 4 years ago

My main question would be: do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.

That said, part of the reason I don't use them much has been due to the fact that the library is pre-1.0 and these seem the most likely item to be removed to me.

In principal I like the idea of being able to specify something is a BString rather than a Vec<u8> -- in compound types like Vec<Vec<u8>> it can be a bit non-obvious what this represents, whereas Vec<BString> is much more clear (to me, anyway).

I think the main other changes I can think of would be API additions, and thus compatible.

BurntSushi commented 4 years ago

do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.

I think so. There are two main use cases I envision for them:

  1. They provide a target for trait impls that is distinct from &[u8] and Vec<u8>. This is significant because the latter may be covered by blanket impls. (This is why serde_bytes exists. One can use bstr instead of serde_bytes since bstr has optional serde support.
  2. You can use BString and &BStr types internally instead of Vec<u8> and &[u8] to automatically benefit from their std::fmt::Debug impls. That way, you don't have to write custom debug impls.
thomcc commented 4 years ago

Yep, that all makes sense to me, and I hadn't considered point 1 at all.

lopopolo commented 4 years ago

Another place that BString and &BStr are useful is to wrap bytes before using them in assert_eq! in tests to get more readable debug output.

Byron commented 3 years ago

Now that gitoxide is thinking about stability as well it became clear that without a 1.0 bstr release this couldn't ever happen. I am glad to see this is actively explored here and will be looking forward to the eventual 1.0 release.

For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding. Turning these back into &[u8] would be something so regrettable that it simply has to be avoided.

BurntSushi commented 3 years ago

For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding.

Could you say more about this? In particular, I would like to hear more about its importance in public APIs.

Byron commented 3 years ago

A good example for its use is a freshly parsed commit whose memory is backed by a buffer somewhere else. The goal is to make clear that these fields represent user-readable strings, but in Rust there is no way to do this without enforcing an encoding or a custom type. That custom type is BStr and once I found it there was no way I would roll my own.

It also works nicely for the mutable version of such a commit.

When testing, working with BString/BStr felt natural and worked exactly as if one would expect.

Note that I put the bstr crate into public APIs knowing it goes against the prominent advice provided in the readme file, that's how indispensable it was 😅.

From there it only proliferated and it's now used in no less than 13 gitoxide crates.

Could you say more about this? In particular, I would like to hear more about its importance in public APIs.

To find a more concise answer: bstr is just too practical and there is a huge void for strings without known encoding in Rust that it seems to fill well for my use.

BurntSushi commented 3 years ago

@Byron Ah wow, that is great feedback. I agree that using BStr/BString in those APIs is probably the right thing, because otherwise the derived Debug representation of those types is going to be totally useless I imagine.

I don't think there is much work to be done to get bstr to 1.0. I'll do my best to prioritize it in the next month or so.

When do you plan to release 1.0 versions of crates that depend on bstr?

Byron commented 3 years ago

Thanks so much! Part of me was afraid I'd be told it's all wrong as it was so strongly discouraged 😅.

Also thanks for the prioritization, even though I think no matter how long it takes you, I will be slower. If it happens at the end of the year I would be served well already.

BurntSushi commented 3 years ago

Ah perfect. My time is super limited and devoting pretty much all of what little free time I have to regex. But I'll do my best to take a detour and whip bstr into a 1.0 release. I think it has had more than enough time to bake and there don't appear to be any obvious design flaws when compared to alternatives.

chrysn commented 3 years ago

Having this stable would also enable its use with the riot-wrappers crate, where for example process names are exposed (they're by all probability ASCII even, but an unsafe assume-it-is opens up for UB if a C user on the same system does something weird, and converting them to &str means pulling in UTF-8 checks, and either fallible operations or panics).

No hurry from me (I'll need to wait with this for my next API bump anyway), just a data point, and thanks for working on this!

Byron commented 2 years ago

Since there hasn't been a lot of movement in the past 9 months and since bstr is a very, very important dependency for gitoxide and probably countless other crates, I wonder what I can do to help. The reason for me getting a bit more pushy is the ongoing work to get gitoxide into cargo.

I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.

Thank a lot :)!

BurntSushi commented 2 years ago

Folks, I've updated the top comment of this issue to include a list of breaking changes (all of which are minor). My plan is to push out the 1.0 release on Monday, July 11, 2022. So please, if you have comments or feedback or other ideas for breaking changes, now is the time to do it.

I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.

For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself. (Which ended up taking almost a full day of work.) It's important to remember this when offering to maintain a project. It's not as simple as me just clicking a button and giving someone permissions. :-)

Byron commented 2 years ago

For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself.

That's certainly true, yet I find it worth clarifying that my offer wasn't limited to the 1.0 release. In any case, thanks for this wonderful crate and the continued maintenance efforts!

joshtriplett commented 2 years ago

I sincerely hope that the version of bstr after 1.0 is simply std (or, more precisely, core and alloc as appropriate).

Thank you for such a fundamental building block of the ecosystem.

BurntSushi commented 2 years ago

Note that I've published 1.0.0-pre.1 to crates.io. It's not on docs.rs yet though.

BurntSushi commented 2 years ago

dddd0 on reddit asked:

Might as well ask here: Why does for_byte_record_with_terminator in BufReadExt consume self / the underlying BufRead? It doesn't seem to be necessary to me.

Indeed, this is I believe a good point. In fact, all of the methods on BufReadExt consume self. I believe the only ones we actually want to consume self are byte_lines and byte_records, since they each return an iterator generic over Self. But the other methods all execute closures, and so I suspect &mut self would be more appropriate there. (It's likely this isn't a show-stopping issue in practice since you can probably always convert your T: BufReadExt to a &mut T: BufReadExt and get away with calling the for_ methods without actually consuming your reader, but it does look like a wart.)

omac777 commented 2 years ago

I read and understand you want bstr to allow for more possibilities around bytes: -string of bytes conforming to utf-8 -string of bytes non-conforming to utf-8, but it's encoding is yet to be determined.

Yesterday, an important and lengthy cloud backup failed to complete. Why? Because the name of either the file or directory contained MACOS emojis and the backup's destination service does not like those kinds of characters(non-conformant). It's related because the file system on MACOS/UNIX/BSD accepts those character ranges be it visible or invisible. Linux xfs/btrfs/ext4 don't have any issues with those character ranges either.

Here is short specific example of an error caused by non-conformance. It may not seem like a big thing, but it is considering it's for a company's backup of data and it delays the backup for the entire set of data. Human intervention is necessary at this point.

rsync --archive someSourceDir/ /mnt/tapedrive/someDestinationDir/

within some subdirectory however there is a file called "blah:foo" notice the colon in there. The backup continues for hours then errors out on the ':' character and does not backup the file in question. If it's a directory name with a colon, it does not backup the directory in question. In other words it complicates matters. Why is that? LTFS formatted tape file systems don't like that ':' colon character in file/directory names. So the use case scenario with an emoji in a filename or directory name brought up another error but this time not on tape backups, but on cloud backups. These are big deals on traditional UNIX-based file systems, but it's a big deal on anything that isn't. I'm guessing AWS S3 and other cloud services have issues like this.

The above brings me to suggest these in the bstr API for helping determine if filenames and directory names are going to be ok before they land on said backup storage destination: bstr::Is_LTFS_Conformant() bstr::Is_S3_Conformant() bstr::Is_Google_Cloud_BitbucketName_Conformant() bstr::Is_Optical_UDF_FS_Conformant()

Oddly enough the only way I've found to circumvent these issues is by tar'ing someSourceDir and then rsync'ing the tar to said destination backup storage, but that's not always the ideal situation. It would be better to find those non-conformant filenames/directory names beforehand and propose some auto-correcting measures to conform to the respective storage range of characters and such.

Why am I mentioning it here? It's because strings, bstr, path are the usual go-to types for holding filenames and directory names. It would be good to have such helper tools available to help everybody save time when doing their own backups without experiencing any errors.

Also to further prevent these kinds of interoperability issues, why not place in the filesystem drivers new constraints that disallow emojis and other unwanted character ranges as filenames and directory names from the beginning. i.e. the openfile/createfile would disallow ':', emojis and such on every operating system. Get all the operating system and storage providers to collaborate on this once and for all.

Thank you for listening. Cheers.

BurntSushi commented 2 years ago

@omac777 Let's please try to keep this thread focused on 1.0 concerns. This isn't an RFC for any arbitrary feature requests, but issues specific to a 1.0 release. I've moved your comment into #109 and responded there.

BurntSushi commented 2 years ago

OK, the docs for what's currently on master have been published: https://docs.rs/bstr/1.0.0-pre.1/bstr/

(I haven't made the changes to stop consuming self for some of the methods on BufReadExt yet.)

Byron commented 2 years ago

I think helpful to address would be issues that could potentially lead to breaking changes before 1.0. Not truly understanding if this is the case or not with what I am about to state, let me share one convenience issue in particular: I am having trouble sometimes to create a BStr instance, usually in situations where type inference bails out preventing "string".into() to work.

Recently I saw code that had to do [].as_bstr() to get an empty &BStr because type inference wouldn't know that a BStr was required. Maybe it's easy enough to have a constructor for empty BStr, and also a function to create new BStr instances directly, like bstr::BStr::empty() or bstr::BStr::new("hello"), the latter similar to std::path::Path::new(…).

BurntSushi commented 2 years ago

@Byron I think these are the related issues for that, right? #84 and #86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.

Basically, when I created BString and BStr, I specifically tried to leave those types as devoid of methods as I could, since their principle purpose is to act as a Deref point for Vec<u8> and &[u8], respectively, and as a target for trait impls that want "byte string" specific behavior (like std::fmt::Debug, serde::Serialize and serde::Deserialize). Adding more methods on to them makes the deref situation a little more dicey. But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.

One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that. Thankfully, it's not a breaking change though.

Byron commented 2 years ago

@Byron I think these are the related issues for that, right? https://github.com/BurntSushi/bstr/issues/84 and https://github.com/BurntSushi/bstr/issues/86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.

84 is definitely related, maybe having other ways to create a &BStr would reduce some pressure there.

But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.

Thanks for your openness! It's certainly not the case here either as I am far from an actual investigation, all I have is nearly every-day usage and some returning pains. I think I was struck by fear that fixing any of these would lead to another breaking change which better happens soon then.

One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that.

BStr::new() seems like the most intuitive, and for now it doesn't appear to conflict with the Deref targets either. I'd be willing to deal with it when it happens, and take the risk, for more convenience now. It's certainly a small thing as well, I found myself creating little helper methods for that throughout the codebase.

BurntSushi commented 2 years ago

I found myself creating little helper methods for that throughout the codebase.

Can you create a new issue with these? That is, just copy and paste your helper routines. (Or if they fit better in an existing issue, that's fine too.)

BurntSushi commented 2 years ago

I've added another breaking change: restructure serde feature flags

Now that bstr has an 'alloc' feature, we need to rethink how we setup the serde feature flags. Previously, all we had was 'std' and 'no std'. But now we have 'std', 'alloc only' and 'core only'. In particular, 'no std' is split into 'alloc only' and 'core only', since neither one bring in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std', and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.

BurntSushi commented 2 years ago

I've added another breaking change: several methods on BufReadExt now take &mut self instead of self.

epage commented 2 years ago

Now that bstr has an 'alloc' feature, we need to rethink how we setup the serde feature flags. Previously, all we had was 'std' and 'no std'. But now we have 'std', 'alloc only' and 'core only'. In particular, 'no std' is split into 'alloc only' and 'core only', since neither one bring in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std', and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.

Would it be worth bumping the MSRV to 1.60 and use weak / namespaced feature flags?

This way you have

BurntSushi commented 2 years ago

@epage Yeah I mused about this here: https://github.com/BurntSushi/bstr/pull/111#issuecomment-1180477086

Basically, I kind of feel like 1.60 is really way too new. My first approximation is to track Debian stable (which is at Rust 1.48), but in principle, I'm okay with something newer.

Unfortunately, there does seem to be a fundamental conflict here. Namely, if we move forward with the existing trichotomy of serde features, then it's going to have to stay that way since I think moving to the scheme you propose would be a breaking change. (Not the MSRV bump, that's fine, but changing the features around.) Since I don't see myself putting out a bstr 2.0 any time soon, it would effectively freeze bstr into a slightly sub-optimal feature configuration. But if I do adopt the nicer feature setup, then I'm being somewhat aggressive with the MSRV.

I personally don't mind being a bit aggressive, but I suppose it depends on my users.

@lopopolo @thomcc @Byron @TethysSvensson How do you feel about Rust 1.60 as the MSRV?

BurntSushi commented 2 years ago

With respect to MSRV, if Rust 1.60 is too new for some folks, then I suppose they could stick with bstr 0.2 until Rust 1.60 is no longer too new.

BurntSushi commented 2 years ago

Given there's a question open here, I'm going to delay the 1.0 release another week.

epage commented 2 years ago

Speaking of

ByteVec::into_os_string now returns Result<OsString, FromUtf8Error> instead of Result<OsString, Vec>.

In the docs, it says

  1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8 by accessing their underlying 16-bit integer representation. Unfortunately, this isn’t zero cost (it introduces a second WTF-8 decoding step) and it’s not clear this is a good thing to do, since WTF-8 should ideally remain an internal implementation detail. ...

While this library may provide facilities for (1) in the future, currently, this library only provides facilities for (2) and (3).

(1) could be simplified, depending on what is done with OsStr, like https://github.com/rust-lang/rust/pull/95290

Should the existing functions have more specific names to open the door for the more general functions to be added later, if possible?

Non-blocker for 1.0: would it also be worth pointing people to os_str_bytes to help people who want (1)?

EDIT: Now split out as #115, #116

epage commented 2 years ago

As a usability note, I wrote an application with BStr in the internal APis and kept getting tripped up that I needed as_bstr() or .map(ByteSlice::as_bstr) at the end of any transformation I did on a BStr to get it back into its own type. In the ideal world, the output type would match the input type.

I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.

EDIT: Split out into #117

lopopolo commented 2 years ago

@BurntSushi 1.60.0 MSRV works for me. Applications I build with bstr target the latest stable and library crates I've built with bstr have a "bump MSRV in minor versions" policy which I'd be happy to exercise to accommodate a higher bstr MSRV.

On a related note, I've been using the weak features and namespaced features in recent cargo and have found them to be quite pleasant to work with. Using serde1 = ["dep:serde", ...] means that a --features serde flag is no longer valid like it is now.

BurntSushi commented 2 years ago

As a usability note, I wrote an application with BStr in the internal APis and kept getting tripped up that I needed as_bstr() or .map(ByteSlice::as_bstr) at the end of any transformation I did on a BStr to get it back into its own type. In the ideal world, the output type would match the input type.

I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.

Right yeah. That's what bstr 0.1 was: https://docs.rs/bstr/0.1.4/bstr/

TethysSvensson commented 2 years ago

How do you feel about Rust 1.60 as the MSRV?

I think it's probably the right choice in this situation, given that you are trying to stabilize.

In planus, our base policy is to only bump the MSRV to versions that are at least six months old, which would mean that we could not start using bstr until October. That being said, it's a policy we try to follow, but break when there is sufficient reason to do so.

(While writing this, I am also realizing that our MSRV simultaneously says 4 versions of rust and 6 months, which are two very different things)

BurntSushi commented 2 years ago

@lopopolo Yeah the new Cargo feature stuff is awesome. I very much like that it gives more control over what's a public feature.

@epage See https://github.com/BurntSushi/bstr/issues/5 and https://github.com/BurntSushi/bstr/pull/8 for more details on why bstr 0.1 -> 0.2 went from concrete BString/BStr types to extension traits on Vec<u8>/[u8]. It is slightly sad that if you have a BStr and call an extension trait method that returns a subslice of it that you get a &[u8] back instead. I'm not sure there is any simple machinery that would let me fix that while maintaining the existing extension trait API on [u8].

Happy to answer more questions on that but would prefer opening a new issue for it if you have more questions.

Byron commented 2 years ago

How do you feel about Rust 1.60 as the MSRV?

I'd be OK with it, thanks for asking.

lopopolo commented 2 years ago

Hi @BurntSushi I know this is late in the game for 1.0, but I wanted to float an idea of an optional dependency on simdutf8 for a SIMD-accelerated fast path to crate::utf8::validate:

https://github.com/BurntSushi/bstr/blob/0d9d22237f859d359ec676cff255b848c772e5a3/src/utf8.rs#L463-L471

I'm not sure how you'd want to structure this, but maybe an optional performance feature a la regex (that maybe does nothing for 1.0 but leaves the possibility for additional deps open in subsequent releases?).

Edit: coming back to this, I'm actually not sure whether I should expect the DFA in bstr or SIMD routines in simdutf8 to be faster. I just associate SIMD with 🏎️

BurntSushi commented 2 years ago

Yeah that's not a 1.0 concern. I'm unlikely to take a dependency for that. Instead, I would rather see the implementation ported to bstr or maybe even memchr. That said, I've never looked at it in depth, so I'm unsure of its complexity.

Feel free to open a new issue about this to avoid cluttering up the 1.0 release thread.

For this thread, I would really like to keep it scoped to 1.0 concerns.

Lokathor commented 2 years ago

September update on Rust versions in the Linux world:

1.60 seems a completely reasonable MSRV to start with for 1.0. Linux distros will catch up to 1.60 "very soon", and the 0.2 version is still usable until then.

You could also document that the MSRV will stay up to date with Debian Stable once they eventually get past 1.60, or whatever other reference point.

BurntSushi commented 2 years ago

OK, bstr 1.0.0-pre.3 is up: https://docs.rs/bstr/1.0.0-pre.3/bstr/

I've rejiggered/simplified the features (just a serde feature and no more serde1-{std,alloc,core} goop) and bumped the MSRV to Rust 1.60.

My new 1.0 release date is September 6, 2022. If there are any last comments on backward incompatible changes, now is the time to make them. :)

BurntSushi commented 2 years ago

Done!

1.0 tag: https://github.com/BurntSushi/bstr/releases/tag/1.0.0

Blog post: https://blog.burntsushi.net/bstr/

lopopolo commented 2 years ago

Congrats @BurntSushi! This is huge. Thank you so much for all of the effort put into this high quality, foundational crate. bstr makes so much possible for me and my projects. 🙇