citp / BlockSci

A high-performance tool for blockchain science and exploration
https://citp.github.io/BlockSci/
GNU General Public License v3.0
1.34k stars 259 forks source link

Filtering addresses using 'where' #375

Closed bhemen closed 4 years ago

bhemen commented 4 years ago

I'm getting inconsistent results when filtering addresses using a 'where.'

For example, suppose I want to get the list of addresses that have exactly one input and one output. I tried

addresses = chain.addresses(blocksci.address_type.pubkey).where( lambda a: a.in_txes_count() == 1 and a.out_txes_count() == 1 )

The code above runs, but gives very different results from

all_addresses = chain.addresses(blocksci.address_type.pubkey)
LC_addresses = [a for a in all_addresses if a.in_txes_count() == 1 and a.out_txes_count() == 1]

I would have expected them to give essentially the same results.

The where query is not behaving as I'd expect because the range returned seems to be the same when I change the filters (see below).

Reproduction Steps

import blocksci
from timeit import default_timer as timer

#V0.6
parser_cfg="./blocksci.cfg"
chain = blocksci.Blockchain(parser_cfg)

all_addresses = chain.addresses(blocksci.address_type.pubkey)

start_time = timer()
LC_addresses = [a for a in all_addresses if a.in_txes_count() == 0 and a.out_txes_count() == 1]
end_time = timer()
print( f"{len(list(LC_addresses))} addresses with 1 input and 0 outputs (list comprehension)" )
print( f"Elapsed time: {end_time - start_time} seconds")
print( f"=======================" )

start_time = timer()
addresses = chain.addresses(blocksci.address_type.pubkey).where( lambda a: a.in_txes_count() == 0 and a.out_txes_count() == 1 )
print( f"{len(list(addresses))} addresses with 1 input and 0 outputs" )
end_time = timer()
print( f"Elapsed time: {end_time - start_time} seconds")
print( f"=======================" )

start_time = timer()
LC_addresses = [a for a in all_addresses if a.in_txes_count() == 1 and a.out_txes_count() == 1]
end_time = timer()
print( f"{len(list(LC_addresses))} addresses with 1 input and 1 output (list comprehension)" )
print( f"Elapsed time: {end_time - start_time} seconds")
print( f"=======================" )

start_time = timer()
addresses = chain.addresses(blocksci.address_type.pubkey).where( lambda a: a.in_txes_count() == 1 and a.out_txes_count() == 1 )
print( f"{len(list(addresses))} addresses with 1 input and 1 output" )
end_time = timer()
print( f"Elapsed time: {end_time - start_time} seconds")
print( f"=======================" )

start_time = timer()
LC_addresses = [a for a in all_addresses if a.in_txes_count() <= 1 and a.out_txes_count() == 1]
end_time = timer()
print( f"{len(list(LC_addresses))} addresses with 1 input and at most 1 output (list comprehension)" )
print( f"Elapsed time: {end_time - start_time} seconds")
print( f"=======================" )

start_time = timer()
addresses = chain.addresses(blocksci.address_type.pubkey).where( lambda a: a.in_txes_count() <= 1 and a.out_txes_count() == 1 )
print( f"{len(list(addresses))} addresses with 1 input and at most 1 output" )
end_time = timer()
print( f"Elapsed time: {end_time - start_time} seconds")
print( f"=======================" )

The outputs are

38087 addresses with 1 input and 0 outputs (list comprehension)
Elapsed time: 14754.32744676806 seconds
=======================
215429 addresses with 1 input and 0 outputs
Elapsed time: 1826.0210773628205 seconds
=======================
177321 addresses with 1 input and 1 output (list comprehension)
Elapsed time: 6544.447323760018 seconds
=======================
215429 addresses with 1 input and 1 output
Elapsed time: 1834.473372163251 seconds
=======================
215408 addresses with 1 input and at most 1 output (list comprehension)
Elapsed time: 12175.197346912697 seconds
=======================
215429 addresses with 1 input and at most 1 output
Elapsed time: 1828.7590126655996 seconds
=======================

Notice that all the where queries seem to return the same number of results. Is this the correct way to use 'where'?

Thanks for your help.

System Information

Using AMI: No BlockSci version: 0.6 (fc34ac3) Blockchain: Bitcoin Parser: Disk Total memory: 128 GB

maltemoeser commented 4 years ago

Logical and, or and not don't work with the proxy interface, unfortunately. You need to use the bitwise operators &, | and ~ instead.

So, instead of

chain.addresses(blocksci.address_type.pubkey).where(lambda a: a.in_txes_count() == 0 and a.out_txes_count() == 1)

use

chain.addresses(blocksci.address_type.pubkey).where(lambda a: (a.in_txes_count() == 0) & (a.out_txes_count() == 1))
maltemoeser commented 4 years ago
import blocksci
chain = blocksci.Blockchain("/blocksci/testchain.json")
len(chain)

250000

chain.addresses(blocksci.address_type.pubkey).where(lambda a: a.in_txes_count() == 0 and a.out_txes_count() == 1).size

194169

chain.addresses(blocksci.address_type.pubkey).where(lambda a: (a.in_txes_count() == 0) & (a.out_txes_count() == 1)).size

38247

all_addresses = chain.addresses(blocksci.address_type.pubkey)
filtered_addresses = [a for a in all_addresses if a.in_txes_count() == 0 and a.out_txes_count() == 1]
len(filtered_addresses)

38247