ZuInnoTe / hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Apache License 2.0
141 stars 51 forks source link

Timestamp based BTC Transaction Graph #63

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hello,

I am using the scala-spark-graphx-bitcointransaction in order to construct the graph of the whole blockchain. I need to split the graph per year and then per month. So, I assigned to each transaction (on extraction) the timestamp and convert it to a date. The problem is that when the example code creates the edges I observe something that looks weird.

The format is: ((ByteArray, Idx), ((srcIdx,srcAdress), (destIdx,destAddress))) like at the example with just added Year and month.

((org.zuinnote.spark.bitcoin.ByteArray@a744f541,0,2011,1),
(((148451,2011,4),bitcoinaddress_A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2),
((197663,2011,1),bitcoinaddress_F2BB12CEBD45FD0E1FCF0335B29D77BCEE480D6C)))

((org.zuinnote.spark.bitcoin.ByteArray@a744f541,0,2011,1),
(((31211,2010,11),bitcoinaddress_A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2),
((197663,2011,1),bitcoinaddress_F2BB12CEBD45FD0E1FCF0335B29D77BCEE480D6C)))

As you can see the source and the destination year and month do not match and in the first case source is 4/2011 and the destination 1/2011. Is that normal? Except that the first parenthesis (ByteArray, Idx) matches exactly the (destIdx, destAddress). Am I doing something wrong?

jornfranke commented 5 years ago

Do you join also on the Date? Maybe you can share the some source code ?

Am 09.10.2018 um 22:49 schrieb Thodoris Zois notifications@github.com:

Hello,

I am using the scala-spark-graphx-bitcointransaction in order to construct the graph of the whole blockchain. I need to split the graph per year and then per month. So, I assigned to each transaction (on extraction) the timestamp and convert it to a date. The problem is that when the example code creates the edges I observe something that looks weird.

The format is: ((ByteArray, Idx), ((srcIdx,srcAdress), (destIdx,destAddress))) like at the example with just added Year and month.

((org.zuinnote.spark.bitcoin.ByteArray@a744f541,0,2011,1), (((148451,2011,4),bitcoinaddress_A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2), ((197663,2011,1),bitcoinaddress_F2BB12CEBD45FD0E1FCF0335B29D77BCEE480D6C)))

((org.zuinnote.spark.bitcoin.ByteArray@a744f541,0,2011,1), (((31211,2010,11),bitcoinaddress_A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2), ((197663,2011,1),bitcoinaddress_F2BB12CEBD45FD0E1FCF0335B29D77BCEE480D6C))) As you can see the source and the destination year and month do not match and in the first case source is 4/2011 and the destination 1/2011. Is that normal? Except that the first parenthesis (ByteArray, Idx) matches exactly the (destIdx, destAddress). Am I doing something wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ghost commented 5 years ago

Sure, actually its your code with some modifications. But we do I have to join the Date? I mean the code is the same , the result should be the same as with your code even with the timestamp.

val bitcoinBlocksRDD = sc.newAPIHadoopFile(inputFile, classOf[BitcoinBlockFileInputFormat], classOf[BytesWritable], classOf[BitcoinBlock],hadoopConf)

// Extract a tuple per transaction containing Bitcoin destination address, the input transaction hash, the input transaction output index, and the current transaction hash, the current transaction output index, a (generated) long identifier
val btcTuples = bitcoinBlocksRDD.flatMap(hadoopKeyValueTuple => extractTransactionData(hadoopKeyValueTuple._2))
val rowRDD = btcTuples.map(p => Row(p._1, p._2, p._3, p._4, p._5, p._6))

val transactionSchema = StructType(
      Array(  
                 StructField("dest_address", StringType, true),
             StructField("curr_trans_input_hash", BinaryType, false),
             StructField("curr_trans_input_output_idx", LongType, false),
             StructField("curr_trans_hash", BinaryType, false),
             StructField("curr_trans_output_idx", LongType, false),
             StructField("timestamp", IntegerType, false) 
            )
    )

val btcDF = sqlContext.createDataFrame(rowRDD, transactionSchema)     
val btcDF_Date_Timestamp =  btcDF.select($"dest_address", $"curr_trans_input_hash", $"curr_trans_input_output_idx", $"curr_trans_hash", $"curr_trans_output_idx", $"timestamp").
                          withColumn("realTime", to_date(from_unixtime($"timestamp"))).
                                      withColumn("year", year($"realTime")).
                                      withColumn("month", month($"realTime"))

val btcDF_Date = btcDF_Date_Timestamp.drop("timestamp").drop("realTime")

// Create the graph
// Extract a tuple per transaction containing Bitcoin destination address, the input transaction hash, the input transaction output index, and the current transaction hash, the current transaction output index, a (generated) long identifier
val bitcoinTransactionTuples = btcDF_Date.rdd

// ((Bitcoin destination address, Year, Month), vertexId)
val bitcoinAddressInd = bitcoinTransactionTuples.map(bitcoinTransactions =>(bitcoinTransactions(0), bitcoinTransactions(5),bitcoinTransactions(6))).distinct().zipWithIndex()

// create the vertex (Bitcoin destination address, (Year, Month, vertexId)), keep in mind that the flat table contains the same bitcoin address several times
val bitcoinAddressIndexed = bitcoinAddressInd.map(btcTrans => (btcTrans._1._1, (btcTrans._2, btcTrans._1._2, btcTrans._1._3)))

// Create edges
// This is basically a self join, where ((currentTransactionHash,currentOutputIndex), identfier) is joined with ((inputTransactionHash,currentinputIndex), indentifier)

// (bitcoinAddress,(byteArrayTransaction, TransactionIndex, Year, Month))       
val inputTransactionTuple =  bitcoinTransactionTuples.map(bitcoinTransactions => (bitcoinTransactions(0),(new ByteArray(serialise(bitcoinTransactions(1))),bitcoinTransactions(2),bitcoinTransactions(5),bitcoinTransactions(6))))

// (bitcoinAddress,((byteArrayTransaction, TransactionIndex, Year, Month),(vertexId, Year, Month)))
val inputTransactionTupleWithIndex = inputTransactionTuple.join(bitcoinAddressIndexed)

// ((byteArrayTransaction, TransactionIndex, Year, Month), ((vertexId, Year, Month), bitcoinAddress))
val inputTransactionTupleByHashIdx = inputTransactionTupleWithIndex.map(iTTuple => (iTTuple._2._1,(iTTuple._2._2,iTTuple._1)))

val currentTransactionTuple =  bitcoinTransactionTuples.map(bitcoinTransactions => (bitcoinTransactions(0),(new ByteArray(serialise(bitcoinTransactions(3))),bitcoinTransactions(4),bitcoinTransactions(5),bitcoinTransactions(6))))
val currentTransactionTupleWithIndex = currentTransactionTuple.join(bitcoinAddressIndexed)
val currentTransactionTupleByHashIdx = currentTransactionTupleWithIndex.map{cTTuple => (cTTuple._2._1,(cTTuple._2._2,cTTuple._1))}

//Edge(joinTuple._2._1._1,joinTuple._2._2._1) [srcIdx, destIdx]
// the join creates ((ByteArray, Idx), ((srcIdx,srcAdress), (destIdx,destAddress)))
val joinedTransactions = inputTransactionTupleByHashIdx.join(currentTransactionTupleByHashIdx)  
jornfranke commented 5 years ago

Sorry, yes you are right, you only convert it, let me look a little bit more at the code and I hope to find some answer tomorrow

ghost commented 5 years ago

Sure yes, no problem! By the way, I have printed only 5 edges (the other 3 look fine) and I have used only the blk00000.dat

jornfranke commented 5 years ago

Some Analysis I checked on blockchain.info the following (maybe it helps you for further explanation): ((org.zuinnote.spark.bitcoin.ByteArray@a744f541,0,2011,1), (((148451,2011,4),bitcoinaddress_A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2), ((197663,2011,1),bitcoinaddress_F2BB12CEBD45FD0E1FCF0335B29D77BCEE480D6C)))

the first address A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2 has several transactions around 20-04-2011: https://www.blockchain.com/btc/address/1FxQAa5qtxFNuMSUe3yMsDqu6LeVyWyviN

the second address has two transactions in a block of 30-01-2011: https://www.blockchain.com/btc/address/1P8SdWtZUh77JzH6G8YA7mHx6PfbxR3vXx

Hence, your goal is to identify 30-01-2011.

I am little bit confused about the first part of your printed output (highlighted in bold) - do you know what this is about, it looks correct: ((org.zuinnote.spark.bitcoin.ByteArray@a744f541,0 ,2011,1), (((148451,2011,4),bitcoinaddress_A40BF9DE1B13837CFF147D7A9D12DDD6354F08D2), ((197663,2011,1),bitcoinaddress_F2BB12CEBD45FD0E1FCF0335B29D77BCEE480D6C))) That topic is a bit complicated - due to Spark Graphx Scala code (not yours in general), which is not easily readable and the complexity of the Bitcoin blockchain. The original example creates a graph where the nodes are bitcoin addresses and the edges represent transfers between addresses.

Some potential issue Wha could happen in your case is that the join is mixed up due to the additional data: https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/rdd/PairRDDFunctions.html#join(org.apache.spark.rdd.RDD)

as far as i understood from your code comments:

// ((byteArrayTransaction, TransactionIndex, Year, Month), ((vertexId, Year, Month), bitcoinAddress)) val inputTransactionTupleByHashIdx = inputTransactionTupleWithIndex.map(iTTuple => (iTTuple._2._1,(iTTuple._2._2,iTTuple._1)))

That means the data is joined on (byteArrayTransaction, TransactionIndex, Year, Month), but it should only be joined on (byteArrayTransaction, TransactionIndex) I am not sure why it spills out exactly your output which roughly makes sense (maybe your other code provides some explanation or I need to check in more detail what is going wrong with your code).

Some potential solution I assume you want to have the date later in the graph on the edges. So instead of // ((byteArrayTransaction, TransactionIndex, Year, Month), ((vertexId, Year, Month), bitcoinAddress)) you should have // ((byteArrayTransaction, TransactionIndex), ((vertexId), (Year, Month, bitcoinAddress)) for inputTransaction and currentTransaction

then to create the edges. This is in the original example val bitcoinTransactionEdges = joinedTransactions.map(joinTuple=>Edge(joinTuple._2._1._1,joinTuple._2._2._1,"input") ) ie the edges is the SrcVertexId, DestVertexId, and some data (here "Input"). You could for example put (assuming you do the proposed tuples above) :joinedTransactions.map(joinTuple=>Edge(joinTuple._2._1._1,joinTuple._2._2._1,("input",joinTuple._2._1._2._1,joinTuple._2._1._2._2)) ) in this case the edge data would contain "input", Year, Month

It is just written without exactly testing, so I am not yet one hundred sure if I got the numbers correct (it is also late), but I hope this gives you a start to understand how this could work. However, I will also see to create a Wiki or blog to explain the Spark Graphx code, because I agree it is complicated giving the Bitcoin complexity. However, please ask if something was unclear on my side!

ghost commented 5 years ago

First of all, I would like to thank you a lot for your detailed answer! I am going to try everything out now, I just have one question before starting. You correctly assume that I want the date later in the graph on the edges (basically, just right before constructing the edges). Maybe I should not include the date when creating the vertices (btc_address, index)? Except that, I don't need the vertices at the final graph. I just store the edges so I can construct the graph later for processing.

After done everything that you said and removing the date from the vertices (I kept it only at inputTransaction and outputTransaction)

I have some lines that look like the one below after the final join. The format is ((ByteArray, Idx), ((srcIdx, (date, srcAdress)), (destIdx, (date, destAddress)))

((org.zuinnote.spark.bitcoin.ByteArray@680edcee,0),
((912,(2011-04-07,bitcoinaddress_2DDC866B59587C7EFEF375E2A8D05870A9FA4059)),
(352930,(2011-02-27,bitcoinaddress_9EEF20C26C8D146058D1BC7DCEAD68C73CB24EA9))))

Again, the some dates look weird. Since this entry is a single transaction from the date should be the same right? Or at lest the 352930 should be a future date?

jornfranke commented 5 years ago

sorry for the late reply. I will try to simulate the case of dates tomorrow in the original example code, because I am a little bit more familiar with it.

if all dates are the wrong order you could of course simply reverse the join, e.g. val joinedTransactions =currentTransactionTupleByHashIdx.join(inputTransactionTupleByHashIdx)

I have to rethink the original use case, but I think there was a reason (maybe this one does not apply to your case) why i point back from which address the money came from, I guess it was to be able to go back from the current address back to where it came from using a graph algorithm.

Nothing of course prevents you to do it also the other way around or having both at the same time.

ghost commented 5 years ago

I will do it the other way around just to see if the results are better, but those joined transactions are not from the same block? I mean you take a block you check the list of transactions and you create from to pairs. Then, they shouldn’t have the same timestamp? If you can have a look at it, it will help me a lot. I am confused and competely stucked..

jornfranke commented 5 years ago

In fact, a transaction input referencing another transaction output is never from the same block. Check also this address on the Wiki entry of the example program: https://en.bitcoin.it/wiki/From_address

It has some graph which may make it easier to understand the nature of the transaction Graph. Each transaction input in t refers (in block b2) to a previous transaction p (in block b1). The date only exists at the block level (meaning all transaction in a block happen at the same time). However, the Bitcoin transaction graph does not reveal which outputs of p have been used for one transaction input in t. Meaning that potentially all outputs of p have to be linked to the input in t.

The more I explain in text, the more difficult it sounds =) What I usually do is take a piece of paper and draw it. The Bitcoin wiki, but also other sources (e.g. Slideshare) have some nice graphs how the Bitcoin transaction graph is structure. It is a complicated topic to explain via text.

ghost commented 5 years ago

I think I got it now, thank you for your explanation also. Even, if its plain text I got the basic idea and I looked further. So from my understanding, the dates of input transactions should be past dates when comparing to the date of current transaction. Also, in order to create the graph on certain time snapshots I have to consider the current transaction timestamp, right?

If yes, there is something going really wrong because I have entries after the join that the inputTransaction date is greater than the currentTransaction date. Even if I swap the join, the problem remains in general.

The code where I insert the timestamp is the same as yours but I just added at the end the bitcoinBlock.getTime() where you do result(counter) = (BitcoinScriptPatternParser..........).

I inspected further the data and I printed the input and current transactions that are going to be joined later. Those 3 transactions below are going to join later and it will be a match since they have the same ByteArray and transactionIndex.

(bitcoinaddress_EC21AA0262C7A9B82C4252B2BA087ECE00152FDA,(org.zuinnote.spark.bitcoin.ByteArray@957477bc,0,2010,11))

(bitcoinaddress_2412E48C9DB6847AC4F94906963CC4744C4F618D,(org.zuinnote.spark.bitcoin.ByteArray@957477bc,0,2010,11))

(bitcoinaddress_E2CCD6EC7C6E2E581349C77E067385FA8236BF8A,(org.zuinnote.spark.bitcoin.ByteArray@957477bc,0,2010,10))

And this is what happen after the join (inputTransaction, currentTransaction):

((org.zuinnote.spark.bitcoin.ByteArray@957477bc,0),
((195811,(2010,11,bitcoinaddress_EC21AA0262C7A9B82C4252B2BA087ECE00152FDA)),
(231281,(2010,10,bitcoinaddress_E2CCD6EC7C6E2E581349C77E067385FA8236BF8A))))

So, what I am thinking is that I use the same timestamp for inputTransactions and that is something wrong. Is that true?

jornfranke commented 5 years ago

can you repost the latest source code that you use?

i hope to have some time later today to integrate in the example in this repository also the date of transaction. I believe you have to select only one date when extracting and this is currentTransaction. inputTransaction should not contain a date.

ghost commented 5 years ago

Yes you are right.. I fixed it finally! The date should be only on the currentTransaction and not the input, since inputs come from previous blocks with different timestamps.

After I finish with the project, I will create a pull request with the final solution. I can also write something to explain your code (with my includings also)!

jornfranke commented 5 years ago

Sure but I meant current transaction in the extract transaction data

If you create a pull request feel free to create it as a new example ie as another sub folder of the examples directory

Thanks !

Am 16.10.2018 um 00:56 schrieb Thodoris Zois notifications@github.com:

Yes you are right.. I fixed it finally! The date should be only on the currentTransaction and not the input, since inputs come from previous blocks with different timestamps.

After I finish with the project, I will create a pull request with the final solution. I can also write something to explain your code (with my includings also)!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.