Platform Independent Distribution of Data

Currently BlockSci data files consist of the direct serialization of BlockSci's data structures from memory onto disk in the exact layout which the compiler assigns. This allows data to be loaded with near 0 processing necessary, but also means that distribution of BlockSci's data files is difficult since it is dependent on the many factors that can effect memory layout of C++ classes.

Creating a platform independent intermediate data format would allow us to distribute our processed Blockchain data so that others could download it rather than requiring that people run BlockSci's parser themselves.

Further incremental data updates could be posted which would allow people to maintain fairly up to date copies of BlockSci blockchain data without running a full node.

Solving this issue would significantly lower the entry-barrier of deploying and using BlockSci.

Thus, I tried to brainstorm what possible issues could be, specifically regarding @hkalodner's "many factors that can effect memory layout of C++ classes". However, I can't come up with many factors - any help is appreciated here.

Possible issues: struct alignment and padding, endianess, pointers, RocksDB's & Google DenseHashMap's data structure

Struct alignment and padding (of memory-mapped files): I think this might be resolved by using struct packing, eg. __attribute__((__packed__)). This causes a performance penalty on many platforms, which could be avoided by manually padding (eg. by inserting dummy variables) the structures to be ideal for common 64bit platforms. Thus, the data should be portable, but still is optimized (padded and aligned correctly) for common platforms.
Endianess: Most common architectures seem to use little-endian, so this shouldn't be a problem. A detection that warns the user about the "wrong" (big) endianess might be helpful for uncommon architectures.
Pointers: The memory-mapped files do not contain any (raw) pointers, so this should not be a problem.
Facebook's RocksDB: We need to check if the persistent files of RocksDB are platform-independent.
Google's DenseHashMap (parser only): We need to check if the serialized data using the built-in serializer is platform-independent.

If supporting only common platforms is enough, most of the above can maybe solved rather easily, see the suggestion in the first bullet.

Another possible (maybe more elegant, but also more time-consuming) solution is use a platform-neutral like Google's Protocol Buffers (protobuf) before distribution the files. Something like blocksci_parser config.json export <path> to export data to protobuf-format and blocksci_parser config.json import <path> to import distributed data.

This is just a first step to solving this issue and there are several open questions. Eg., should parsing locally still work for distributed (pre-parsed) BlockSci data? (This determines if we need to ship the parser state).

citp / BlockSci

Platform Independent Distribution of Data #2