Closed robertbastian closed 1 year ago
A diff doesn't necessarily mean there is an incompatibility. What matters is whether the serialization continues to work and match semantics remain the same.
Could yoy say more about what gave you the impression that there would never be any diff whatsoever?
It would also help to provide a reproduction so that I can more closely look at it.
I will also take a look at the changes made to regex-syntax in that version span when I get a chance and see if I can make some guesses.
@sffc
This is troubling on a few levels
Could you please say more?
We've verified forwards and backwards compatibility between 0.6.25 and 0.6.28 and are looking for confirmation that this follows from semver.
The regex that changes serialization is o|ho|8|(11(\.?\d\d\d)*(,\d*)?([^\.,\d]|$))
(whether a Spanish word starts with an 'o' sound). Serialization before after
The regex i|hi([^ae]|$)
remains unchanged if that helps.
use regex_automata::{dfa::dense::{Builder, Config}, SyntaxConfig};
let mut builder = Builder::new();
builder
.syntax(SyntaxConfig::new().case_insensitive(true))
.configure(Config::new().anchored(true).minimize(true))
.build(r"o|ho|8|(11(\.?\d\d\d)*(,\d*)?([^\.,\d]|$))")
.unwrap()
.to_sparse()
.unwrap()
.to_bytes_little_endian()
regex-automata = { version = "0.2", default-features = false, features = ["alloc"] }
From what I can tell, your regex uses the Unicode definition of \d
and between 0.6.25 and 0.6.28 there were updates to Unicode 14 and Unicode 15. That seems like the most likely explanation to me. Nothing else in the commit log stands out to me. I didn't 100% confirm it though.
We've verified forwards and backwards compatibility between 0.6.25 and 0.6.28 and are looking for confirmation that this follows from semver.
From the docs for DFA::from_bytes
:
The bytes given must be generated by one of the serialization APIs of a DFA using a semver compatible release of this crate.
In other words, yes, the compatibility of the serialization format is indeed considered part of the API and is a semver guarantee. The part that confused me here is that you reported an issue without any visible breaking change, but rather, because the size of the serialized bytes changed. Size changes are not incompatible changes. Size changes may happen in patch releases because of a bug fix, or because of an optimization, or a pessimization, or functionality additions, or things like Unicode updates. That's what I'm confused about here.
The serialization format of a DFA even includes a version string. I'd be totally within my rights to completely change the format within a semver compatible release so long as the library continues to deserialize the old format correctly. (I do consider such things a last resort, but that freedom does exist, although I hope I never use it.)
Ah that explains it, thanks.
Size changes are not incompatible changes.
No, but they're an indicator of behaviour changes, which, while allowed under semver, we want to investigate and avoid. We should have used the library in a way so that we don't observe behaviour changes across Unicode versions, that's on us.
I'd be totally within my rights to completely change the format within a semver compatible release so long as the library continues to deserialize the old format correctly.
Uhm I don't think so. The phrasing you use on the from_bytes
and the to_bytes_...
docs covers both direction. Data generated with 0.3 needs to work with 0.2, so you cannot do this.
Data generated with 0.3 needs to work with 0.2, so you cannot do this.
0.3 is semver incompatible with 0.2. Anything goes. Consider the implication of your interpretation. It would make any evolution of the serialization format effectively impossible as far as I can tell.
And to put a pin in it: DFAs generated by 0.3 (not yet released) will absolutely not be compatible with 0.2.
The phrasing you use
Which phrasing? It's always qualified with "semver compatible." I already quoted the relevant section for deserialization above. And for serialization:
The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release
So you have zero guarantees about the stability of the serialization format across semver incompatible releases.
I have to ask that if you want to continue this discussion, then please start being more specific instead of just saying "the phrasing."
0.3 is semver incompatible with 0.2
Huh? What is semver compatible to 0.2?
0.2.1
is semver compatible with 0.2.0
. 0.3.0
is not semver compatible with 0.2.0
. But I'm not sure where you're confused. This might help: https://doc.rust-lang.org/cargo/reference/semver.html
In particular:
This guide uses the terms "major" and "minor" assuming this relates to a "1.0.0" release or later. Initial development releases starting with "0.y.z" can treat changes in "y" as a major release, and "z" as a minor release. "0.0.z" releases are always major changes. This is because Cargo uses the convention that only changes in the left-most non-zero component are considered incompatible.
Sorry I don't seem to understand semver and that's where the misunderstanding comes from.
Is 1.0
semver compatible with 1.1
? So is everything shifted for pre-releases?
Again, see the bolded section from what I quoted:
This is because Cargo uses the convention that only changes in the left-most non-zero component are considered incompatible.
1.0.0
is semver compatible with 1.0.1
, 1.0.9248
, 1.1.0
, 1.93850930.309359
. It is semver incompatible with any version x.y.z
such that x != 1
.
In the Cargo ecosystem, semver compatibility to determined by the leftmost non-zero digit. So since the leftmost non-zero digit in 0.1.0
is 1
, that makes it incompatible with 0.2.0
where its leftmost non-zero digit is 2
. And it keeps going. 0.0.1
is, for example, semver incompatible with 0.0.2
. Of course, any versions in which the leftmost non-zero digit appear in different places are also incompatible. So 1.0.0
and 0.1.0
both have the same leftmost non-zero digit, but they occur in different positions and are thus incompatible.
This is how every crate I'm aware of on crates.io functions. It's also why many folks don't mind staying at pre-1.0, because you can still get the benefits of semver. (The semver spec famously says that "anything goes" pre-1.0, but Cargo uses a stricter convention.)
We have observed a diff in a serialized regex that was caused by updating the
regex-syntax
crate from 0.6.25 to 0.6.28.We had been under the assumption that serialization is stable under Rust semver (i.e. major version), which does not seem to be the case.
Is serialization forwards and backwards compatible under the same major version?