Open boshmaf opened 5 years ago
Not an answer to your question, but have you considered implementing the analysis in C++? I'd assume that to be a lot faster than doing the analysis using the Python interface.
Hello @maltemoeser. First of all, thanks for the wonderful system. BlockSci is useful in so many ways for what we're doing here at CIBR. We're building an open-source stack on top of BlockSci to enable off/on-chain analytics in a more usable and systematic way. Some of the requirements for this stack is tagging, linking, and searching addresses, txes, blocks, and chains using both a search engine-like style and a lower-level query language, which directly uses BlockSci lib.
As a proof-of-concept, we used the Python interface to meet some of these requirements. As such, we met the functional requirements except performance-related ones. We're migrating our design to C++ and Go, but for the sake of the PoC, I was wondering if there's a design pattern that is generally recommended for data-heavy workloads like the one in the example on a single machine.
As for this question, I'd like to keep it focused on the Python interface, because I know many people use it for rapid prototyping or running experiments as part of their academic research (which we do too). I'd love to chat with BlockSci team to share more details about what we're doing, as it might be interesting to you or be part of your roadmap anyway. Thanks!
System Information
Using AMI: No
BlockSci version: 0.5.0 Blockchain: Bitcoin Parser: Disk Total memory: 256 GB
CPU count: 96
I have a general question about the best way to run a function which consumes BlockSci objects in parallel. We use BlockSci in our research, which is mostly off/on-chain analytics for security and privacy applications. I know the Map/Reduce interface but sometimes it is restricting if you're passing along other data or have different computation models (e.g., a graph).
Example
In other words, we want to find all txes in X such that each tx: (1) has a block time that is less than the block time of any tx in Y, and if not (2) those txes in Y that have a block time less than that of tx do not have input or output addresses in common with tx which are in A.
Here's one way to do this in parallel:
On described system, it took about 1h30m to run on 92 cores, where len(A)=15K addresses, len(X)=200K txes, and len(Y)=350K txes.
You will notice that I'm unpickling all BlockSci objects locally for each process to avoid sharing them. This of course consumes more memory, which is becoming a bottleneck in this case (usually, our computations are CPU-bound).
Is this the right way to do this? Is there a more efficient way to do this (in terms of both space and time)? Any general guidelines, especially BlockSci-specific, would be great. Thanks!