citp / BlockSci

A high-performance tool for blockchain science and exploration
https://citp.github.io/BlockSci/
GNU General Public License v3.0
1.34k stars 260 forks source link

Heuristics questions #433

Open barakshani opened 4 years ago

barakshani commented 4 years ago

I've started exploring BlockSci, first with V0.5.1, and now upgraded to V0.7.0. I have several questions about the (change) heuristics. My focus is using the ClusterManager on the entire blockchain (create clustering), not running the heuristics on some given transaction.

  1. Is it possible, using the Python interface, to apply more complex logic on the different heuristics? For example, if the legacy heuristic returns a single change address use it, otherwise if there was no fresh address apply the power of ten heuristic, and if there was no optimal change, then apply the address type heuristic?

  2. In particular, I'd like to run the legacy heuristic but exclude address reuse, so I'd like the heuristic to first check if there is address reuse - if so, return it as change (actually put it in the same cluster as the input, which is already the case) - and if not, then to run the legacy heuristic. At the moment I'm running it using the "and" and "or" operators. Is there a reason it is not already part of the legacy (Client Behavior) heuristic?

  3. It seems that the Client Change Address Behavior only looks if the address was fresh at the time of the transaction, but ignores whether the address was reused at some later point. As this heuristic is classified as "dynamic", I thought that this is how it works. Is there a way to make it only return addresses that appear only once throughout the entire history?

  4. More generally, it is not very clear what is the actual implication of a heuristic being dynamic. The documentations says that "If outputs of a transaction are spent" -- is this the only thing that it checks (as indicated under locktime and peeling change)? It is also not very clear if change.spent can only be combined with dynamic heuristics. Lastly, if the locktime heuristic only returns unspent outputs ("This heuristic will return unspent outputs as potential candidates."), then the example given at the very end: blocksci.heuristics.change.locktime & blocksci.heuristics.change.spent will always be empty.

  5. I couldn't find the exact definition of the peeling change heuristic. What would that be?

BTW, if not specifying change.unique, would running the ClusterManager on the entire blockchain might classify several potential change addresses as a single cluster?

maltemoeser commented 4 years ago

To apply more complex logic I'd suggest to implement it in C++ and then expose in the Python interface (you could even just modify an existing function), the Python interface is somewhat limited in its ability to allow you to specify such complex logic.

You're correct about the mislabeling of the "fresh address" heuristic, it indeed does not depend on whether the output is being spent, just whether the address appeared for the first time. I'll fix that in the docs. Dynamic was meant to imply that it depends on the output being spent by another transaction. The locktime heuristic would also (not only) return any unspent outputs.

Yes, if you don't use unique change then it can return multiple candidates. In general, these functions often encode simple heuristics and I would caution against simply applying them to a full blockchain clustering.