fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

feat: Avoid loading all notes in memory #2088

Closed douglaz closed 1 year ago

douglaz commented 1 year ago

Will fix https://github.com/fedimint/fedimint/issues/82 by streaming up data from DB in descending denomination order then only using the required notes.

Why draft: Only after implementing almost everything I realized the key encoding using little endian won't lead to a sane ordering.

There are not many comments on Internet, but the usual recommendation seems to always use big endian for keys on LevelDB/RocksDB to preserve ordering.

As changing the endianess may be impactful, I'm opening this PR earlier as a draft to discuss the possibilities. There are some options:

  1. Change the consensus encoding to use big endian for primitives
  2. Change only NoteKey to use big endian
  3. Figure out what are the denominations then make multiple queries using each denomination to load the notes in a random Nonce order
  4. Change database
  5. Forget about it: if you are rich enough to have a big wallet you are rich enough to have a powerful computer/smartphone

There is a variation of 1) and 2) where you could implement some custom rocksdb comparator, but it would still require changing the encoding because it's just inconsistent to write data in little endian using the usual ordering of fields. To work you would need to start with the last field, then going backwards

Note the code right know is implementing 1)

m1sterc001guy commented 1 year ago

Interesting! Thanks for looking at this. Unless I'm mistaken this is the first use case where we've needed the values retrieved from the database to be in a specific order.

Looking at the options you've presented, it doesn't look like there's a backwards compatible way of solving this (either a db change or a consensus change seems to be needed). So it might be good to change something soon before there are too many fedimint deployments where this becomes more of a headache.

but it would still require changing the encoding because it's just inconsistent to write data in little endian using the usual ordering of fields. To work you would need to start with the last field, then going backwards

I didn't quite understand what you meant by this. Would the Rocksdb custom comparator be doable if we just switched around the order of the serialization of the Note's fields?

More generally we might want to think about supporting a "sort key" or something within a struct. We might run into this problem in the future with other structs.

justinmoon commented 1 year ago

So it might be good to change something soon before there are too many fedimint deployments where this becomes more of a headache.

As more context -- backwards incompatible change is fine now, but will be much harder in a month or two.

douglaz commented 1 year ago

but it would still require changing the encoding because it's just inconsistent to write data in little endian using the usual ordering of fields. To work you would need to start with the last field, then going backwards

I didn't quite understand what you meant by this. Would the Rocksdb custom comparator be doable if we just switched around the order of the serialization of the Note's fields?

I think it's trickier than that, AFAIK the change would need to be for the whole database: you would need to change the way prefixes are written (it should be the last thing) and also create a custom prefix extractor.

So today we have:

(DbKeyPrefix::Note, NoteKey.amount, NoteKey.nonce)

It should be written as:

(NoteKey.nonce, NoteKey.amount, DbKeyPrefix::Note)

And also every other type on same DB should be something like

(key fields..., DbKeyPrefix::type)

At least for a naive comparator and a simple encoding. Alternatively you could encode it in a more complex way and make the comparator know how to sort every type of your database. This of course would require a smart prefix extractor too.

douglaz commented 1 year ago

More generally we might want to think about supporting a "sort key" or something within a struct. We might run into this problem in the future with other structs.

You could do this at encoding time (easier/faster) or at comparison time (trickier/slower).

But note you can't just change the sorting after you've inserted the data without some sort of data migration.

m1sterc001guy commented 1 year ago

Hmm sorry I'm still not understanding. The prefixes we use today are all one byte, so big endian vs little endian shouldn't matter for the prefix, so I don't see why its necessary to move the prefix to the end. I see in your PR you are creating a RocksDb iterator in the Reverse direction. Are you doing this because you need the notes in descending order? Is this why the prefix needs to move to the end? What am I missing?

douglaz commented 1 year ago

Hmm sorry I'm still not understanding. The prefixes we use today are all one byte, so big endian vs little endian shouldn't matter for the prefix, so I don't see why its necessary to move the prefix to the end. I see in your PR you are creating a RocksDb iterator in the Reverse direction. Are you doing this because you need the notes in descending order? Is this why the prefix needs to move to the end? What am I missing?

If you write everything in big endian you can use the current field order and the default byte wise comparator, which should be something like:

fn be_comparator(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
    a.cmp(b)
}

Because a will be something like:

[32, // 0x20 = DbKeyPrefix::Note
0, 0, 0, 0, 0, 0, 1, 0, // amount = 256 in big endian
<64 bytes of nonce in usual order>
]

And b may be something like:

[ 33, // 0x21 = DbKeyPrefix::OutputFinalizationData
<32 bytes of transaction id in usual order>
<8 bytes of out_idx in big endian>
]

If you want to use little endian, then the straightforward choice would be: A) Reverse the order of all fields during encoding, including the prefix. Besides encoding big integers in little endian you also need to reverse the order of vectors as a general rule. So in this example: a:

[
<64 bytes of nonce in reverse order>
0, 1, 0, 0, 0, 0, 0, 0, // amount = 256 in little endian (reverse order)
32, // 0x20 = DbKeyPrefix::Note
]

b:

[
<8 bytes of out_idx in little endian (reverse order)>,
<32 bytes of transaction id in reverse order>,
33 // 0x21 = DbKeyPrefix::OutputFinalizationData
]

Then use as a comparator something like:

fn le_comparator(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
    a.iter().rev().cmp(b.iter().rev())
}

B) But of course, for some weird reason, you can keep the DbKeyPrefix as is and just change the fields of the keys. You would have something like: a:

[
32, // 0x20 = DbKeyPrefix::Note
<64 bytes of nonce in reverse order>
0, 1, 0, 0, 0, 0, 0, 0, // amount = 256 in little endian (reverse order)
]

b:

[
33 // 0x21 = DbKeyPrefix::OutputFinalizationData
<8 bytes of out_idx in little endian (reverse order)>,
<32 bytes of transaction id in reverse order>,
]

Then the comparator needs to understand this mess:

fn le_comparator_that_understands_key_prefix(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
    if a[0] != b[0] { // prefix byte type is out of order, check it first
        return a[0].cmp(&b[0]);
    }
    // now continue in the usual reverse order ignoring the prefix byte
    a[1..].iter().rev().cmp(b[1..].iter().rev())
}

Regarding the Forward/Reverse value of rocksdb iterator, don't get confused, it is unrelated to this particular discussion.

m1sterc001guy commented 1 year ago

Ok I see. Thank you for the explanation! It looks like it comes down to the cleanliness of the comparison operator, which wasn't initially clear to me. I don't actually think your last example is that terrible, we already have code that assumes the presense of a prefix (see from_bytes in fedimint-core/src/db/mod.rs). IMO it is better to write a somewhat uglier comparator and change one database structure than change the entire layout of the database.

Also, we actually additionally append a fixed module byte and a 2 byte "module id" for database entries that are modularized to disambiguate between different modules (example above is using client data structures which don't have these yet). I think this is fine though since I don't think it makes sense to sort db structures between different modules.

douglaz commented 1 year ago

Ok I see. Thank you for the explanation! It looks like it comes down to the cleanliness of the comparison operator, which wasn't initially clear to me. I don't actually think your last example is that terrible, we already have code that assumes the presense of a prefix (see from_bytes in fedimint-core/src/db/mod.rs). IMO it is better to write a somewhat uglier comparator and change one database structure than change the entire layout of the database.

Note that any solution requires a backwards incompatible change. The simplest one IMHO is just to swap le functions for be functions, like I've already done here: https://github.com/fedimint/fedimint/pull/2088/files#diff-33eea2a08ddca078501b323f2038f6f0408f934c95406e533d474e4615c62136R132 (plus some other few places where I didn't touch)

But going in the little endian way requires changing most macros and implementations of Encodable

elsirion commented 1 year ago

The little/big endian discussion confuses me a lot, endianess typically only applies to single fields/integers, not whole structs.

When it comes to how our DB orders keys I always assumed lexicographic ordering of the serialized bytes, e.g.:

  1. 01 00
  2. 01 00 01
  3. 01 04
  4. 02 03
  5. 03 02
  6. 03 07

If that isn't the case we'd need to review a lot of our code for this assumption.

With that in mind having a key structure of <prefix> <amount> <remainder> would be sufficient. Notes would be selected as follows:

  1. Get overview of number of notes by iterating over all elements with prefix <prefix> and count notes per denomination (only one key is held in memory at a time, the actual notes don't need to be parsed for that either, this might need a special key struct)
  2. Select note denominations using the notes-per-denomination info
  3. For each selected denomination query DB entries with prefix <prefix> <amount> and collect the appropriate number of notes from the stream. Only the selected ones are deserialized.

This way we don't need to fetch+deserialize the whole wallet every time. Step 1 can be further optimized by caching the number of available notes per denomination.

dpc commented 1 year ago

So the root issue is that Bitcoin consensus encoding is needlessly little endian and this breaks lexicographical sorting when we are using it for keys in the database? Or am I confused about anything?

douglaz commented 1 year ago

The little/big endian discussion confuses me a lot, endianess typically only applies to single fields/integers, not whole structs.

But if your struct has integers greater than 1 byte, then it matters. The way you serialize the integer may or may not keep a lexicographic ordering of the serialized bytes.

When it comes to how our DB orders keys I always assumed lexicographic ordering of the serialized bytes, e.g.:

1. `01 00`

2. `01 00 01`

3. `01 04`

4. `02 03`

5. `03 02`

6. `03 07`

If that isn't the case we'd need to review a lot of our code for this assumption.

Yes, I also had this assumption because sometimes the code looks likes it has this assumption but after some testing I realized this isn't the case due to integers being serialized as little endian.

With that in mind having a key structure of <prefix> <amount> <remainder> would be sufficient. Notes would be selected as follows:

1. Get overview of number of notes by iterating over all elements with prefix `<prefix>` and count notes per denomination (only one key is held in memory at a time, the actual notes don't need to be parsed for that either, this might need a special key struct)

2. Select note denominations using the notes-per-denomination info

3. For each selected denomination query DB entries with prefix `<prefix> <amount>` and collect the appropriate number of notes from the stream. Only the selected ones are deserialized.

This way we don't need to fetch+deserialize the whole wallet every time. Step 1 can be further optimized by caching the number of available notes per denomination.

It would be an interesting optimization to avoid deserializing the full note in some cases, but what you are describing isn't necessary for operations like select_notes if the keys have lexicographic ordering. See the PR.

There are some corner cases where the algorithm you are describing uses less memory than the one I've implemented. Without caching it will certainly be slower, but with caching it may be faster.

douglaz commented 1 year ago

So the root issue is that Bitcoin consensus encoding is needlessly little endian and this breaks lexicographical sorting when we are using it for keys in the database? Or am I confused about anything?

Yeah, using little endian on big integers breaks lexicographical sorting.

It is related because Fedimint is using little endian for its big integers and this was probably inspired by Bitcoin. But note that, for instance, Amount is a Fedimint data structure, note a Bitcoin one. So we could serialize it in any suitable way.

In fact AFAIK even Bitcoin uses big endian when storing keys on LevelDB. So you can't just say this is Bitcoin's fault.

douglaz commented 1 year ago

This little test may help visualize the issue. It's similar to the serialization code used on Fedimint. Note that the sort will be automatically done by RocksDB when inserting/retrieving the serialized struct as a key. Amount may be a key or part of a key.

#[cfg(test)]
mod tests {
    #[derive(Debug, PartialEq, PartialOrd)]
    struct Amount(u64);
    impl Amount {
        fn serialize_be(&self) -> Vec<u8> {
            let mut data: Vec<u8> = Vec::new();
            data.extend(self.0.to_be_bytes());
            data
        }
        fn serialize_le(&self) -> Vec<u8> {
            let mut data: Vec<u8> = Vec::new();
            data.extend(self.0.to_le_bytes());
            data
        }
        fn deserialize_be(data: &Vec<u8>) -> Self {
            Self(u64::from_be_bytes(data.as_slice().try_into().unwrap()))
        }
        fn deserialize_le(data: &Vec<u8>) -> Self {
            Self(u64::from_le_bytes(data.as_slice().try_into().unwrap()))
        }
    }
    #[test]
    fn test() {
        let list = vec![Amount(255), Amount(256), Amount(1024)];
        let mut serialized_be = list
            .iter()
            .map(Amount::serialize_be)
            .collect::<Vec<Vec<u8>>>();
        // lexicographical sorting on serialized integers using big endian
        serialized_be.sort();
        println!("be: {:?}", serialized_be.iter().map(hex::encode).collect::<Vec<String>>());
        let list_from_be = serialized_be
            .iter()
            .map(Amount::deserialize_be)
            .collect::<Vec<Amount>>();
        // keeps ordering when deserializing
        assert_eq!(list_from_be, list);
        let mut serialized_le = list
            .iter()
            .map(Amount::serialize_le)
            .collect::<Vec<Vec<u8>>>();
        // lexicographical sorting on serialized integers using little endian
        serialized_le.sort();
        println!("le {:?}", serialized_le.iter().map(hex::encode).collect::<Vec<String>>());
        let list_from_le = serialized_le
            .iter()
            .map(Amount::deserialize_le)
            .collect::<Vec<Amount>>();
        // will be unable to keep ordering when deserializing
        assert_eq!(list_from_le, list); // fails :(
    }
}

It outputs:

be: ["00000000000000ff", "0000000000000100", "0000000000000400"]
le ["0001000000000000", "0004000000000000", "ff00000000000000"]
thread 'tests::test' panicked at 'assertion failed: `(left == right)`
  left: `[Amount(256), Amount(1024), Amount(255)]`,
 right: `[Amount(255), Amount(256), Amount(1024)]`', src/lib.rs:51:9

Note how le serialization sorts the list in a wrong way: [256, 1024, 255]

dpc commented 1 year ago

Yeah, so a lot of people complain about Bitcoin consensus encoding using little endian. Fedimint code copied rust-bitcoin's encoding code, AFAICT, but then we diverged completely due to run-time extensible module support.

It seems to me that we should just migrate to big-endian. I don't think anything is holding us back - we are never going to merge with Bitcoin consensus encoding anyway, due to model support, and it's kind of now or never. Soon we'll be having backward compatibility to deal with.

elsirion commented 1 year ago

Thx for the great explanation @douglaz, I get it now, I was under the faulty assumption that LE encoding would merely reverse order, but as you point out it totally breaks it.

It seems to me that we should just migrate to big-endian.

I agree, just switching endianness for our own structs should be pretty painless and avoids completely changing the field order of all DB keys.

https://github.com/fedimint/fedimint/blob/5ef8158163c8cfa120b235e24c07999724bc33a0/fedimint-core/src/encoding/mod.rs#L128-L154

douglaz commented 1 year ago

It seems to me that we should just migrate to big-endian.

I agree, just switching endianness for our own structs should be pretty painless and avoids completely changing the field order of all DB keys.

I will open a new PR to address this change. I think it's a good idea to add tests checking the lexicographical sorting properties of the serialized data structures so we can rely on it in future.

m1sterc001guy commented 1 year ago

I agree, just switching endianness for our own structs should be pretty painless and avoids completely changing the field order of all DB keys.

It does unfortunately break backwards compatability as @douglaz has mentioned before. I assume we're ok with that since doing this later will be a real pain. If we want, we can add a database migration to read using LE and write it back using BE.

At the very least, the database backups will need to be updated.

dpc commented 1 year ago

If we want, we can add a database migration to read using LE and write it back using BE.

This would be an extreme PITA, as I guess we would need a whole new set of Encodable traits, and then a copy of every datastructure to use one or the other, so we can write a migration that can read old one and write new one.

At the very least, the database backups will need to be updated.

I would just delete and regenerate from scatch instead. :shrug:

douglaz commented 1 year ago

Replaced by https://github.com/fedimint/fedimint/pull/2183