Open dan-da opened 7 months ago
I wanted to see if it affects more than just [255; 40], so I modified the test to loop over 40 input values, starting with [254; 40] and setting one additional byte to 255 each iteration.
#[test]
pub fn bytes_in_matches_bytes_out() {
for x in 0..40 {
let mut bytes = vec![];
for y in 0..40 {
match y < x {
true => bytes.push(255),
false => bytes.push(254),
}
}
let bytes1 = <[u8; 40]>::try_from(bytes).unwrap();
let d1 = Digest::from(bytes1);
let bytes2: [u8; 40] = d1.into();
let d2 = Digest::from(bytes2);
if bytes1 != bytes2 {
println!("bytes1: {:?}", bytes1);
println!("bytes2: {:?}", bytes2);
}
assert_eq!(d1, d2);
assert_eq!(bytes1, bytes2);
}
}
Failing output is:
---- math::digest::digest_tests::bytes_in_matches_bytes_out stdout ----
bytes1: [255, 255, 255, 255, 255, 255, 255, 255, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254]
bytes2: [254, 255, 255, 255, 0, 0, 0, 0, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254]
thread 'math::digest::digest_tests::bytes_in_matches_bytes_out' panicked at twenty-first/src/math/digest.rs:487:13:
assertion `left == right` failed
left: [255, 255, 255, 255, 255, 255, 255, 255, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254]
right: [254, 255, 255, 255, 0, 0, 0, 0, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254]
So if the first 8 bytes (sizeof u64) are all 255, we start to see the behavior.
I don't understand Digest/BFieldElement internals, so just treating it as a black-box and documenting what I'm seeing.
I'm not sure what the best course of action is. To explain the behavior:
A finite field of prime order $p$ can be seen as $\mathbb{Z}/p\mathbb{Z}$, “the integers modulo $p$.” Any integer maps to some field element; (infinitely) many integers map to the same field element.
In the case of BFieldElement
, we have $p = \mathtt{0xffffffff00000001}$ giving rise to the following passing test:
#[test]
fn bytes_to_bfe() {
let bytes_0 = [0; BFieldElement::BYTES];
let bytes_1 = [0x01, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff];
let bfe_0 = BFieldElement::from(bytes_0);
let bfe_1 = BFieldElement::from(bytes_1);
assert_eq!(bfe_0, bfe_1);
}
Mathematically, this is expected and desired behavior: both the integer $0$ and the integer $p$ correspond to the finite field element $0$. In other words, there are u64::MAX - BFieldElement::P
$\approx 2^{32}$ BFieldElements
that have a non-unique 8-byte-representation. Converting a BFieldElement
to bytes gives the smallest possible number; the result is never 0xffffffffffffffff
because 0x00000000fffffffe
is the same field element.
One possible solution for the failing test you describe is to do a round-trip test the other way around: Digest to bytes to Digest. That might not be what you're after, though.
Yeah I wasn't sure if it's a real problem or just "surprising edge-case behavior". But I figured its best to raise, just in case.
Ok, so just thinking about this from perspective of Digest API user:
A Digest is a hash of some number. digest = hash(x).
So it is expected/required that x can never be recovered from digest. We can never do hash(x) --> digest --> x.
However in this case we are never calling hash(). We are directly instantiating a digest from an array of bytes, which may represent text. So it is surprising (seems wrong) that we sometimes do not get the same value out as we put in, especially since often/usually we do.
It sounds though that this is not really a supported usage. So ideally the API would be modified to make this usage impossible. That could mean that if the incoming bytes do not match the resulting BFieldElements that an error or panic is thrown. Alternatively it could mean that we don't provide a public API for directly instantiating a digest. Rather the only way to instantiate is via Digest::hash(). A final (easiest?) option would be to document this behavior clearly and perhaps make a test case around it demonstrating the behavior as correct/expected.
I guess a final option is to do nothing, as it shouldn't be affecting neptune-core.
ps: I was a little surprised that my hex proptest which does bytes --> digest --> hex --> digest --> hex didn't find this. I thought proptests were supposed to try edge cases like min/max, but maybe it is just doing random values, or maybe it's just not running enough iters.
I was a little surprised that my hex proptest which does bytes --> digest --> hex --> digest --> hex didn't find this. I thought proptests were supposed to try edge cases like min/max, but maybe it is just doing random values, or maybe it's just not running enough iters.
I'm surprised by this too; to the best of my knowledge, edge cases are preferred by proptest
. However, even with #[proptest(cases = 1_000_000)]
, I don't get a failing test case.
Yeah I wasn't sure if it's a real problem or just "surprising edge-case behavior". But I figured its best to raise, just in case.
Yeah, definitely worth raising. :+1:
So it is surprising (seems wrong) that we sometimes do not get the same value out as we put in, especially since often/usually we do.
I concur.
It sounds though that [instantiating a
Digest
directly] is not really a supported usage.
I concur. As far as I can tell, we only have that API for “internal” (albeit cross-crate) functionality.
So ideally the API would be modified to make this usage impossible. That could mean that if the incoming bytes do not match the resulting BFieldElements that an error or panic is thrown.
I like this approach in general, but it's a little tricky. Coming from the mathematical angle, I'd expect base field element 0xffffffffffffffff
to be identical to 0x00000000fffffffe
. However, to the average user that does not know or care about prime fields – and they shouldn't have to – this behavior is indeed odd, potentially seeming erroneous.
Since Digest::from
uses BFieldElement::from
under the hood – and it should keep doing that in my opinion – the oddities are inherited.
As a random, not fleshed-out idea: how about denying conversion of a hex &str
if it was not in canonical representation, but leaving From<[u8; Digest::BYTES]> for Digest
as-is? That way, if you throw a random assortment of bytes into the from()
, you might get surprising behavior, but the try_from_hex
does not propagate the surprise.
Alternatively it could mean that we don't provide a public API for directly instantiating a digest. Rather the only way to instantiate is via Digest::hash().
I think this a fine solution, too. Is it possible to #[doc(hidden)]
some impl From<X> for Y
? If so, we might want to consider that.
[Another] option would be to document this behavior clearly and perhaps make a test case around it demonstrating the behavior as correct/expected.
I like this solution less because it requires users to understand what is happening. This is similar to how I dragged you into understanding what's going on instead of things being obvious to you. (I think in this case it's arguabley less bad because you're a dev on the team, but even so.)
I guess a final option is to do nothing, as it shouldn't be affecting neptune-core.
I think there are hidden dangers lurking in this approach, and I like it the least. Someone somewhere down the line might write a tool that uses Digest::to_hex
assuming that this representation is unique, and because that assumption is incorrect, random stuff breaks.
I concur. As far as I can tell, we only have that API for “internal” (albeit cross-crate) functionality.
I made a little test where I changed:
impl From<[u8; Digest::BYTES]> for Digest {
fn from(item: [u8; Digest::BYTES]) -> Self {
to
impl Digest {
fn from(item: [u8; Digest::BYTES]) -> Self {
This changes the visibility of Digest::from
from public to private. I then ran cargo check --tests
in both twenty-first and neptune-core, and did not encounter any errors with either. It seems that nobody outside Digest module is directly using impl From<[u8; Digest::BYTES]> for Digest
, so there's no breakage if we decide to make it private, or (I think better) change it into a TryFrom.
afaik, there's no way (yet) to make an impl Trait
private in rust. Though I did find this article which is interesting, but doesn't really help here since we aren't defining the From
trait.
Anyway, my takeway is that we could rename from (for bytes) to a private from_bytes
or impl TryFrom instead without breaking upstream crates.
So ideally the API would be modified to make this usage impossible. That could mean that if the incoming bytes do not match the resulting BFieldElements that an error or panic is thrown.
I like this approach in general, but it's a little tricky. Coming from the mathematical angle, I'd expect base field element
0xffffffffffffffff
to be identical to0x00000000fffffffe
. However, to the average user that does not know or care about prime fields – and they shouldn't have to – this behavior is indeed odd, potentially seeming erroneous. SinceDigest::from
usesBFieldElement::from
under the hood – and it should keep doing that in my opinion – the oddities are inherited.As a random, not fleshed-out idea: how about denying conversion of a hex
&str
if it was not in canonical representation, but leavingFrom<[u8; Digest::BYTES]> for Digest
as-is? That way, if you throw a random assortment of bytes into thefrom()
, you might get surprising behavior, but thetry_from_hex
does not propagate the surprise.
I think this could work well in combination with the idea above to impl TryFrom instead of From. The is_canonical() check could exist there. try_from_hex
would then call try_from(bytes)
instead. note: I'm unsure how to do the is_canonical() check, so would leave that to you.
I'm presently leaning towards this solution as it seems clean and is non-breaking for our crates.
Note that serde serialization also works in a Try fashion expecting possible errors, so serialization would not be affected by the change. Caveat: unless a non-canonical value somehow becomes input to deserialize, in which case an error would result, but that should not happen in normal operation, as serialize() only produces canonical values.
Alternatively it could mean that we don't provide a public API for directly instantiating a digest. Rather the only way to instantiate is via Digest::hash().
I think this a fine solution, too. Is it possible to
#[doc(hidden)]
someimpl From<X> for Y
? If so, we might want to consider that.
I like this solution also in theory, but I believe it would be a significant breaking change. ::hash()
presently operates on self
. It doesn't accept any input bytes or BFieldElements. Given how pervasively Digest is used, I'd be nervous about making such a change.
[Another] option would be to document this behavior clearly and perhaps make a test case around it demonstrating the behavior as correct/expected.
I like this solution less because it requires users to understand what is happening. This is similar to how I dragged you into understanding what's going on instead of things being obvious to you. (I think in this case it's arguabley less bad because you're a dev on the team, but even so.)
agree.
I guess a final option is to do nothing, as it shouldn't be affecting neptune-core.
I think there are hidden dangers lurking in this approach, and I like it the least. Someone somewhere down the line might write a tool that uses
Digest::to_hex
assuming that this representation is unique, and because that assumption is incorrect, random stuff breaks.
I'm glad we agree. ;-)
my takeway is that we could rename from (for bytes) to a private
from_bytes
or impl TryFrom instead without breaking upstream crates.
Thanks for running that experiment! In that case, I vote for that approach.
I think this could work well in combination with the idea above to impl TryFrom instead of From. The is_canonical() check could exist there.
try_from_hex
would then calltry_from(bytes)
instead.
That sounds like a good plan. :ok_hand:
note: I'm unsure how to do the is_canonical() check, so would leave that to you.
The following could be included in b_field_element.rs
:
// feel free to change the signature as you see fit – `pub`? worth a `fn`?
pub const fn is_canonical(x: u64) -> bool {
x < Self::P
}
I'm happy to include this myself if you prefer me doing it, but that probably clears up any confusion. :slightly_smiling_face:
::hash()
presently operates onself
. It doesn't accept any input bytes or BFieldElements. Given how pervasively Digest is used, I'd be nervous about making such a change.
I think the method Digest::hash
is used very little if at all. The hash
that is being used pervasively, and produces most of our Digest
s that actually result from hashing stuff, comes from the AlgebraicHasher
trait and in particular Tip5
's implementation of it.
While the documentation of Digest::hash
suggests that it exists to serve Triton VM, we have to manually re-implement it there to provide additional functionality. As far as I can see, Digest::hash
might be a canditate for deprecation & eventual deletion.
I'm glad we agree. ;-)
:partying_face:
Ok, cool. I will prepare another PR with these changes.
I'm happy to include this myself if you prefer me doing it, but that probably clears up any confusion. 🙂
yep, thanks!
As far as I can see, Digest::hash might be a canditate for deprecation & eventual deletion.
ok, wow, ok. I didn't realize it isn't really used. I feel the Digest API is a little weird, or maybe the name. Without ::hash(), it's only a digest because 3rd parties choose to put hash results in it. It could be called BfeArray or something. (I'm not suggesting a name change, just musing).
It's only a
Digest
because 3rd parties choose to put hash results in it. It could be called BfeArray or something.
Well, Digest
is the return type of AlgebraicHasher::hash
, which arguably is the most canonical way to construct a Digest
, and is also 1st party. I think we're currently seeing the type from an angle that's neither the originally intended nor its primary use case, namely as keys for an index.
To give a little more perspective, the only times Triton VM directly constructs a Digest
through either new
, default
, from
/into
, or by executing the tuple constructor, is in tests. All other Digest
s come into being by actually hashing stuff[^sim].
[^sim]: When manually re-implementing the deprecation-candidate Digest::hash
we only need the internal array of BFieldElement
s and never even wrap it in a Digest
. But even there, the resulting “Digest
innards” are the result of hashing something.
I have deprecated Digest::hash()
in e3726c519dcfd4719e6cccc5cdc544ea0d8cbc27.
@dan-da, with a4daa23 and the more recent 7dafa32e both merged, is this still an issue?
In the course of testing hex encode/decode, I discovered that it is possible to create a
Digest
from input bytes that subsequently produces different output bytes.Here is a failing test case, suitable for use in digest.rs.
This test fails with:
note that:
bytes2
is quite different frombytes1
.d1
andd2
are equal, although d2 was created from bytes2.Taken together, this suggests that the input value is being normalized into a smaller internal max value.
Importantly, if the first line is changed to:
Then the test passes. So there is something special about [255; 40]. (and possibly other values?)
if bytes1 is somehow an invalid (too large) input, then it seems that
Digest::from()
should panic, and/or we should only supporttry_from()
. At present, we are silently losing/changing information, but only for certain input values.