Closed douglaz closed 1 year ago
Interesting! Thanks for looking at this. Unless I'm mistaken this is the first use case where we've needed the values retrieved from the database to be in a specific order.
Looking at the options you've presented, it doesn't look like there's a backwards compatible way of solving this (either a db change or a consensus change seems to be needed). So it might be good to change something soon before there are too many fedimint deployments where this becomes more of a headache.
but it would still require changing the encoding because it's just inconsistent to write data in little endian using the usual ordering of fields. To work you would need to start with the last field, then going backwards
I didn't quite understand what you meant by this. Would the Rocksdb custom comparator be doable if we just switched around the order of the serialization of the Note's fields?
More generally we might want to think about supporting a "sort key" or something within a struct. We might run into this problem in the future with other structs.
So it might be good to change something soon before there are too many fedimint deployments where this becomes more of a headache.
As more context -- backwards incompatible change is fine now, but will be much harder in a month or two.
but it would still require changing the encoding because it's just inconsistent to write data in little endian using the usual ordering of fields. To work you would need to start with the last field, then going backwards
I didn't quite understand what you meant by this. Would the Rocksdb custom comparator be doable if we just switched around the order of the serialization of the Note's fields?
I think it's trickier than that, AFAIK the change would need to be for the whole database: you would need to change the way prefixes are written (it should be the last thing) and also create a custom prefix extractor.
So today we have:
(DbKeyPrefix::Note, NoteKey.amount, NoteKey.nonce)
It should be written as:
(NoteKey.nonce, NoteKey.amount, DbKeyPrefix::Note)
And also every other type on same DB should be something like
(key fields..., DbKeyPrefix::type)
At least for a naive comparator and a simple encoding. Alternatively you could encode it in a more complex way and make the comparator know how to sort every type of your database. This of course would require a smart prefix extractor too.
More generally we might want to think about supporting a "sort key" or something within a struct. We might run into this problem in the future with other structs.
You could do this at encoding time (easier/faster) or at comparison time (trickier/slower).
But note you can't just change the sorting after you've inserted the data without some sort of data migration.
Hmm sorry I'm still not understanding. The prefixes we use today are all one byte, so big endian vs little endian shouldn't matter for the prefix, so I don't see why its necessary to move the prefix to the end. I see in your PR you are creating a RocksDb iterator in the Reverse
direction. Are you doing this because you need the notes in descending order? Is this why the prefix needs to move to the end? What am I missing?
Hmm sorry I'm still not understanding. The prefixes we use today are all one byte, so big endian vs little endian shouldn't matter for the prefix, so I don't see why its necessary to move the prefix to the end. I see in your PR you are creating a RocksDb iterator in the
Reverse
direction. Are you doing this because you need the notes in descending order? Is this why the prefix needs to move to the end? What am I missing?
If you write everything in big endian you can use the current field order and the default byte wise comparator, which should be something like:
fn be_comparator(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
a.cmp(b)
}
Because a
will be something like:
[32, // 0x20 = DbKeyPrefix::Note
0, 0, 0, 0, 0, 0, 1, 0, // amount = 256 in big endian
<64 bytes of nonce in usual order>
]
And b
may be something like:
[ 33, // 0x21 = DbKeyPrefix::OutputFinalizationData
<32 bytes of transaction id in usual order>
<8 bytes of out_idx in big endian>
]
If you want to use little endian, then the straightforward choice would be:
A) Reverse the order of all fields during encoding, including the prefix. Besides encoding big integers in little endian you also need to reverse the order of vectors as a general rule. So in this example:
a
:
[
<64 bytes of nonce in reverse order>
0, 1, 0, 0, 0, 0, 0, 0, // amount = 256 in little endian (reverse order)
32, // 0x20 = DbKeyPrefix::Note
]
b
:
[
<8 bytes of out_idx in little endian (reverse order)>,
<32 bytes of transaction id in reverse order>,
33 // 0x21 = DbKeyPrefix::OutputFinalizationData
]
Then use as a comparator something like:
fn le_comparator(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
a.iter().rev().cmp(b.iter().rev())
}
B) But of course, for some weird reason, you can keep the DbKeyPrefix as is and just change the fields of the keys. You would have something like:
a
:
[
32, // 0x20 = DbKeyPrefix::Note
<64 bytes of nonce in reverse order>
0, 1, 0, 0, 0, 0, 0, 0, // amount = 256 in little endian (reverse order)
]
b
:
[
33 // 0x21 = DbKeyPrefix::OutputFinalizationData
<8 bytes of out_idx in little endian (reverse order)>,
<32 bytes of transaction id in reverse order>,
]
Then the comparator needs to understand this mess:
fn le_comparator_that_understands_key_prefix(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
if a[0] != b[0] { // prefix byte type is out of order, check it first
return a[0].cmp(&b[0]);
}
// now continue in the usual reverse order ignoring the prefix byte
a[1..].iter().rev().cmp(b[1..].iter().rev())
}
Regarding the Forward/Reverse
value of rocksdb iterator, don't get confused, it is unrelated to this particular discussion.
Ok I see. Thank you for the explanation! It looks like it comes down to the cleanliness of the comparison operator, which wasn't initially clear to me. I don't actually think your last example is that terrible, we already have code that assumes the presense of a prefix (see from_bytes
in fedimint-core/src/db/mod.rs
). IMO it is better to write a somewhat uglier comparator and change one database structure than change the entire layout of the database.
Also, we actually additionally append a fixed module byte and a 2 byte "module id" for database entries that are modularized to disambiguate between different modules (example above is using client data structures which don't have these yet). I think this is fine though since I don't think it makes sense to sort db structures between different modules.
Ok I see. Thank you for the explanation! It looks like it comes down to the cleanliness of the comparison operator, which wasn't initially clear to me. I don't actually think your last example is that terrible, we already have code that assumes the presense of a prefix (see
from_bytes
infedimint-core/src/db/mod.rs
). IMO it is better to write a somewhat uglier comparator and change one database structure than change the entire layout of the database.
Note that any solution requires a backwards incompatible change. The simplest one IMHO is just to swap le
functions for be
functions, like I've already done here: https://github.com/fedimint/fedimint/pull/2088/files#diff-33eea2a08ddca078501b323f2038f6f0408f934c95406e533d474e4615c62136R132
(plus some other few places where I didn't touch)
But going in the little endian way requires changing most macros and implementations of Encodable
The little/big endian discussion confuses me a lot, endianess typically only applies to single fields/integers, not whole structs.
When it comes to how our DB orders keys I always assumed lexicographic ordering of the serialized bytes, e.g.:
01 00
01 00 01
01 04
02 03
03 02
03 07
If that isn't the case we'd need to review a lot of our code for this assumption.
With that in mind having a key structure of <prefix> <amount> <remainder>
would be sufficient. Notes would be selected as follows:
<prefix>
and count notes per denomination (only one key is held in memory at a time, the actual notes don't need to be parsed for that either, this might need a special key struct)<prefix> <amount>
and collect the appropriate number of notes from the stream. Only the selected ones are deserialized.This way we don't need to fetch+deserialize the whole wallet every time. Step 1 can be further optimized by caching the number of available notes per denomination.
So the root issue is that Bitcoin consensus encoding is needlessly little endian and this breaks lexicographical sorting when we are using it for keys in the database? Or am I confused about anything?
The little/big endian discussion confuses me a lot, endianess typically only applies to single fields/integers, not whole structs.
But if your struct has integers greater than 1 byte, then it matters. The way you serialize the integer may or may not keep a lexicographic ordering of the serialized bytes.
When it comes to how our DB orders keys I always assumed lexicographic ordering of the serialized bytes, e.g.:
1. `01 00` 2. `01 00 01` 3. `01 04` 4. `02 03` 5. `03 02` 6. `03 07`
If that isn't the case we'd need to review a lot of our code for this assumption.
Yes, I also had this assumption because sometimes the code looks likes it has this assumption but after some testing I realized this isn't the case due to integers being serialized as little endian.
With that in mind having a key structure of
<prefix> <amount> <remainder>
would be sufficient. Notes would be selected as follows:1. Get overview of number of notes by iterating over all elements with prefix `<prefix>` and count notes per denomination (only one key is held in memory at a time, the actual notes don't need to be parsed for that either, this might need a special key struct) 2. Select note denominations using the notes-per-denomination info 3. For each selected denomination query DB entries with prefix `<prefix> <amount>` and collect the appropriate number of notes from the stream. Only the selected ones are deserialized.
This way we don't need to fetch+deserialize the whole wallet every time. Step 1 can be further optimized by caching the number of available notes per denomination.
It would be an interesting optimization to avoid deserializing the full note in some cases, but what you are describing isn't necessary for operations like select_notes
if the keys have lexicographic ordering. See the PR.
There are some corner cases where the algorithm you are describing uses less memory than the one I've implemented. Without caching it will certainly be slower, but with caching it may be faster.
So the root issue is that Bitcoin consensus encoding is needlessly little endian and this breaks lexicographical sorting when we are using it for keys in the database? Or am I confused about anything?
Yeah, using little endian on big integers breaks lexicographical sorting.
It is related because Fedimint is using little endian for its big integers and this was probably inspired by Bitcoin.
But note that, for instance, Amount
is a Fedimint data structure, note a Bitcoin one. So we could serialize it in any suitable way.
In fact AFAIK even Bitcoin uses big endian when storing keys on LevelDB. So you can't just say this is Bitcoin's fault.
This little test may help visualize the issue. It's similar to the serialization code used on Fedimint. Note that the sort
will be automatically done by RocksDB when inserting/retrieving the serialized struct as a key. Amount
may be a key or part of a key.
#[cfg(test)]
mod tests {
#[derive(Debug, PartialEq, PartialOrd)]
struct Amount(u64);
impl Amount {
fn serialize_be(&self) -> Vec<u8> {
let mut data: Vec<u8> = Vec::new();
data.extend(self.0.to_be_bytes());
data
}
fn serialize_le(&self) -> Vec<u8> {
let mut data: Vec<u8> = Vec::new();
data.extend(self.0.to_le_bytes());
data
}
fn deserialize_be(data: &Vec<u8>) -> Self {
Self(u64::from_be_bytes(data.as_slice().try_into().unwrap()))
}
fn deserialize_le(data: &Vec<u8>) -> Self {
Self(u64::from_le_bytes(data.as_slice().try_into().unwrap()))
}
}
#[test]
fn test() {
let list = vec![Amount(255), Amount(256), Amount(1024)];
let mut serialized_be = list
.iter()
.map(Amount::serialize_be)
.collect::<Vec<Vec<u8>>>();
// lexicographical sorting on serialized integers using big endian
serialized_be.sort();
println!("be: {:?}", serialized_be.iter().map(hex::encode).collect::<Vec<String>>());
let list_from_be = serialized_be
.iter()
.map(Amount::deserialize_be)
.collect::<Vec<Amount>>();
// keeps ordering when deserializing
assert_eq!(list_from_be, list);
let mut serialized_le = list
.iter()
.map(Amount::serialize_le)
.collect::<Vec<Vec<u8>>>();
// lexicographical sorting on serialized integers using little endian
serialized_le.sort();
println!("le {:?}", serialized_le.iter().map(hex::encode).collect::<Vec<String>>());
let list_from_le = serialized_le
.iter()
.map(Amount::deserialize_le)
.collect::<Vec<Amount>>();
// will be unable to keep ordering when deserializing
assert_eq!(list_from_le, list); // fails :(
}
}
It outputs:
be: ["00000000000000ff", "0000000000000100", "0000000000000400"]
le ["0001000000000000", "0004000000000000", "ff00000000000000"]
thread 'tests::test' panicked at 'assertion failed: `(left == right)`
left: `[Amount(256), Amount(1024), Amount(255)]`,
right: `[Amount(255), Amount(256), Amount(1024)]`', src/lib.rs:51:9
Note how le
serialization sorts the list in a wrong way: [256, 1024, 255]
Yeah, so a lot of people complain about Bitcoin consensus encoding using little endian. Fedimint code copied rust-bitcoin
's encoding code, AFAICT, but then we diverged completely due to run-time extensible module support.
It seems to me that we should just migrate to big-endian. I don't think anything is holding us back - we are never going to merge with Bitcoin consensus encoding anyway, due to model support, and it's kind of now or never. Soon we'll be having backward compatibility to deal with.
Thx for the great explanation @douglaz, I get it now, I was under the faulty assumption that LE encoding would merely reverse order, but as you point out it totally breaks it.
It seems to me that we should just migrate to big-endian.
I agree, just switching endianness for our own structs should be pretty painless and avoids completely changing the field order of all DB keys.
It seems to me that we should just migrate to big-endian.
I agree, just switching endianness for our own structs should be pretty painless and avoids completely changing the field order of all DB keys.
I will open a new PR to address this change. I think it's a good idea to add tests checking the lexicographical sorting properties of the serialized data structures so we can rely on it in future.
I agree, just switching endianness for our own structs should be pretty painless and avoids completely changing the field order of all DB keys.
It does unfortunately break backwards compatability as @douglaz has mentioned before. I assume we're ok with that since doing this later will be a real pain. If we want, we can add a database migration to read using LE and write it back using BE.
At the very least, the database backups will need to be updated.
If we want, we can add a database migration to read using LE and write it back using BE.
This would be an extreme PITA, as I guess we would need a whole new set of Encodable
traits, and then a copy of every datastructure to use one or the other, so we can write a migration that can read old one and write new one.
At the very least, the database backups will need to be updated.
I would just delete and regenerate from scatch instead. :shrug:
Replaced by https://github.com/fedimint/fedimint/pull/2183
Will fix https://github.com/fedimint/fedimint/issues/82 by streaming up data from DB in descending denomination order then only using the required notes.
Why draft: Only after implementing almost everything I realized the key encoding using little endian won't lead to a sane ordering.
There are not many comments on Internet, but the usual recommendation seems to always use big endian for keys on LevelDB/RocksDB to preserve ordering.
As changing the endianess may be impactful, I'm opening this PR earlier as a draft to discuss the possibilities. There are some options:
NoteKey
to use big endianNonce
orderThere is a variation of 1) and 2) where you could implement some custom rocksdb comparator, but it would still require changing the encoding because it's just inconsistent to write data in little endian using the usual ordering of fields. To work you would need to start with the last field, then going backwards
Note the code right know is implementing 1)