citp / BlockSci

A high-performance tool for blockchain science and exploration
https://citp.github.io/BlockSci/
GNU General Public License v3.0
1.34k stars 259 forks source link

Platform Independent Distribution of Data #2

Open hkalodner opened 7 years ago

hkalodner commented 7 years ago

Currently BlockSci data files consist of the direct serialization of BlockSci's data structures from memory onto disk in the exact layout which the compiler assigns. This allows data to be loaded with near 0 processing necessary, but also means that distribution of BlockSci's data files is difficult since it is dependent on the many factors that can effect memory layout of C++ classes.

Creating a platform independent intermediate data format would allow us to distribute our processed Blockchain data so that others could download it rather than requiring that people run BlockSci's parser themselves.

Further incremental data updates could be posted which would allow people to maintain fairly up to date copies of BlockSci blockchain data without running a full node.

mplattner commented 4 years ago

Solving this issue would significantly lower the entry-barrier of deploying and using BlockSci.

Thus, I tried to brainstorm what possible issues could be, specifically regarding @hkalodner's "many factors that can effect memory layout of C++ classes". However, I can't come up with many factors - any help is appreciated here.

Possible issues: struct alignment and padding, endianess, pointers, RocksDB's & Google DenseHashMap's data structure

If supporting only common platforms is enough, most of the above can maybe solved rather easily, see the suggestion in the first bullet.

Another possible (maybe more elegant, but also more time-consuming) solution is use a platform-neutral like Google's Protocol Buffers (protobuf) before distribution the files. Something like blocksci_parser config.json export <path> to export data to protobuf-format and blocksci_parser config.json import <path> to import distributed data.

This is just a first step to solving this issue and there are several open questions. Eg., should parsing locally still work for distributed (pre-parsed) BlockSci data? (This determines if we need to ship the parser state).