Handling scalability of Output Data with directory structure

keithrfung commented 4 years ago

If we have a million or more encrypted ballots, and we want "random access" to them (e.g., a web server that you give the ID of a ballot and it returns the ciphertext). This suggests that we cannot simply write out a single, enormous JSON file with a list of the ballots. However, we could definitely write out one file per ballot.

If you have a million files in a single directory, things get chunky. Just getting a directory listing can be painful. The usual workaround is to use subdirectories. So, if the files were named with six base-10 digits (e.g., 123456.json), you could store them as 12/34/123456.json.
If we stick with JSON for the on-disk representation, which is perfectly reasonable, we probably want to compress it. Python has support for several different compression algorithms (https://docs.python.org/3/library/archiving.html). Security issues could crop up here if the compression code is written in C, so some advance auditing would be relevant. On the other hand, if we go with a binary format (msgpack, protobufs, etc.), then this issue goes away.
So those JSON files have individual encrypted ballots. We still need all the metadata. That probably goes in a "main" JSON file of some sort, which then includes SHA256 hashes of the individual encrypted ballot files. The main JSON file could itself then be digitally signed with conventional tools, or maybe the hash of the main file is published by the election officials and we're done. No need for digital signatures at all? As described, this is a cheesy two-level Merkle tree. If we were worried about the main file getting too big, then we could add another level. It's at best unclear whether we want/need a general-purpose Merkle tree implementation.

resolution of this issue should be paired with #96 to provide a lookup recordset

keithrfung commented 4 years ago

Migrated from comment: https://github.com/microsoft/electionguard-python/issues/14#issuecomment-648372129

keithrfung commented 3 years ago

https://github.com/microsoft/electionguard-python/discussions/289

Election-Tech-Initiative / electionguard-python

Handling scalability of Output Data with directory structure #73