Closed eandreev closed 7 years ago
Thanks for reporting. I will check blk00979.dat again. However, can you please check that you do not have any rev.dat in the input folder. These files are not needed for processing and cannot be processed. They are by default in the Bitcoin folder together with the blk.dat files.
I investigated the issue. It breaks because of the new segwit blocks introduced in BIP-144. More info on the new transaction format can be found here: https://github.com/bitcoin/bips/blob/master/bip-0144.mediawiki
BTW, thanks for such a great library!
i see. blk00979 is a recent file. Maybe it was not fully downloaded yet by Bitcoin core. Normally, i would not expect that it does not parse, but that it simply not loads the segwit data. For example, if it contains segwit data then the current parser see 0 transactions and skips the block.
According to the BIP, the format is not backward compatible, the old parsers will read the first segwit transaction in a block as a transaction with no inputs and one output. They introduced two new byte fields before the input counter: the first one is 0 (the old parser will interpret it as an input counter holding a 0) and the second one = 1. If you see these two bytes in a transaction, then you are looking at a segwit transaction that contains a witness script after the outputs.
yes this is clear, but the current parser should not have thrown an Exception. Anyway, since it is out now and blocking i will see for some emergency fix to read both transaction types and introduce later proper support (Utils etc.).
BTW, there is a better description of the new format: https://github.com/bitcoin/bips/blob/master/bip-0141.mediawiki#Transaction_ID
I cannot yet estimate when this will be available, but this should be rather easy to implement. I originally planned some weeks more time for a full release with Ethereum, but I will delay it and fix this one first.
Thanks a lot!
uploaded test data for segwitness, issue is verified/detect in unit testing. Todo: Add basic support for segwitness, add further unit tests, test in virtual cluster with mapreduce, spark, flink, hive
Most likely only the core library is affected (hadoopcryptoledger), mapreduce, spark, flink, hive do not need additional changes and can simply use the new library
just some update: it was fixed in the input format. It was designed to make the input format compatible with old version. I need to update still the spark data source, implement more unit tests, and run test of all components (flink, spark, spark2, hive, ...) on the virtual cluster. It will be then published as version 1.0.5. probably (no guarantee!) in the course of next week. This means you will be able to read the blocks containing segwit data including the segwit data. Extended segwit support (as described here https://github.com/ZuInnoTe/hadoopcryptoledger/issues/16) will be provided probably towards end of September.
We published 1.0.5 to Maven Central. The issue should not appear anymore with this version and Bitcoin blocks containing segwit data. I did several internal tests. Can you please test as well and provide feedback?
I tested it on a recent copy of the blockchain and got two following exceptions:
(I passed the following setting to spark-submit in order to invoke the new build: --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5
)
java.nio.BufferUnderflowException
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
and:
java.lang.IllegalArgumentException: Illegal Capacity: -2028274993
at java.util.ArrayList.<init>(ArrayList.java:156)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:194)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Can you please provide your parameters and some source code?
Do you know at which block it happens, i.e. The Blk*.dat file?
Do you have rev*.blk files? If so then please delete them.
On 7. Sep 2017, at 23:35, eandreev notifications@github.com wrote:
I tested it on a recent copy of the blockchain and got two following exceptions:
(I passed the following setting to spark-submit in order to invoke the new build: --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5)
java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) and:
java.lang.IllegalArgumentException: Illegal Capacity: -2028274993 at java.util.ArrayList.
(ArrayList.java:156) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:194) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
I meant rev*.dat files
On 7. Sep 2017, at 23:35, eandreev notifications@github.com wrote:
I tested it on a recent copy of the blockchain and got two following exceptions:
(I passed the following setting to spark-submit in order to invoke the new build: --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5)
java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) and:
java.lang.IllegalArgumentException: Illegal Capacity: -2028274993 at java.util.ArrayList.
(ArrayList.java:156) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:194) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
I gave it blk00979.dat
and onwards and it produced the exceptions.
Here's the pyspark script:
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.read \
.format('org.zuinnote.spark.bitcoin.block') \
.options(magic='F9BEB4D9') \
.load('/data/btc/blocks/blk00979.dat')
def emit_big_payouts(block):
# just do nothing
try:
pass
except:
yield (fn,)
exc = df.rdd \
.flatMap(emit_big_payouts) \
.collect()
Here's the spark-submit command:
SPARK_MAJOR_VERSION=2 \
SPARK_HOME=/usr/hdp/current/spark2-client \
spark-submit \
--packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \
--master yarn \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
script-file.py
Just in case, here are the exceptions:
from blk00979.dat
:
java.nio.BufferUnderflowException
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
from blk00981.dat
java.lang.NegativeArraySizeException
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:207)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Also, I get out of memory exceptions on blk00982.dat
even when I give the executor 2g of memory:
java.lang.OutOfMemoryError: Java heap space
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:227)
at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Here's the command:
SPARK_MAJOR_VERSION=2 \
SPARK_HOME=/usr/hdp/current/spark2-client \
spark-submit \
--packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \
--master yarn \
--num-executors 1 \
--driver-memory 2g \
--executor-memory 2g \
--conf spark.yarn.executor.memoryOverhead=1g \
--executor-cores 1 \
script-file.py
Ok I used a block from this file for tests and it worked. However I will check the whole file maybe there is still a strange block inside. I will report back in 1-2 days.
On 8. Sep 2017, at 15:36, eandreev notifications@github.com wrote:
I gave it blk00979.dat and onwards and it produced the exceptions.
Here's the pyspark script:
from pyspark.sql import SQLContext from pyspark import SparkContext
sc = SparkContext() sqlContext = SQLContext(sc)
df = sqlContext.read \ .format('org.zuinnote.spark.bitcoin.block') \ .options(magic='F9BEB4D9') \ .load('/data/btc/blocks/blk00979.dat')
def emit_big_payouts(block):
just do nothing
try: pass except: yield (fn,)
exc = df.rdd \ .flatMap(emit_big_payouts) \ .collect() Here's the spark-submit command:
SPARK_MAJOR_VERSION=2 \ SPARK_HOME=/usr/hdp/current/spark2-client \ spark-submit \ --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \ --master yarn \ --num-executors 1 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 Just in case, here are the exceptions:
from blk00979.dat:
java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) from blk00981.dat
java.lang.NegativeArraySizeException at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:207) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
Hmm thx this is also strange because the library limits the default blocksize to 2M - so an out of memory exception is likely to come from the application .
On 8. Sep 2017, at 15:42, eandreev notifications@github.com wrote:
Also, I get out of memory exceptions on blk00982.dat even when I give the executor 2g of memory:
java.lang.OutOfMemoryError: Java heap space at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:227) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Here's the command:
SPARK_MAJOR_VERSION=2 \ SPARK_HOME=/usr/hdp/current/spark2-client \ spark-submit \ --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \ --master yarn \ --num-executors 1 \ --driver-memory 2g \ --executor-memory 2g \ --conf spark.yarn.executor.memoryOverhead=1g \ --executor-cores 1 \ script-file.py — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
blk00979.dat is one of the first files with sigwit blocks. Only a few of its blocks contain sigwit transaction records. Most likely, you have picked a block without the new kind of transactions.
No this block for sure contained a segwit transaction I verified that it actually read it and I manually inspected it in a hex editor.
On 8. Sep 2017, at 17:05, eandreev notifications@github.com wrote:
blk00979.dat is one of the first files with sigwit blocks. Only a few of its blocks contain sigwit transaction records. Most likely, you have picked a block without the new kind of transactions.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
I just saw that your Python program does a collect which can for a lot of data indeed run into out of memory on the driver especially because you have a 128 mb fiel which is deserialized much bigger. Try a count or a count on exploded transactions (see the example application for the ds in https://github.com/zuinnote/hadoopcryptoledger
Nevertheless for the others issues I will check the library internally and run it on several blk*.dat
On 8. Sep 2017, at 17:05, eandreev notifications@github.com wrote:
blk00979.dat is one of the first files with sigwit blocks. Only a few of its blocks contain sigwit transaction records. Most likely, you have picked a block without the new kind of transactions.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
OK, I changed it to use count()
:
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.read \
.format('org.zuinnote.spark.bitcoin.block') \
.options(magic='F9BEB4D9') \
.load('/data/btc/blocks/blk00969.dat') #blk00979.dat') blk00982.dat
def emit_big_payouts(block):
# just do nothing
try:
pass
except:
yield 1
exc = df.rdd \
.flatMap(emit_big_payouts) \
.count()
When given blk00969.dat
, it completes without errors.
When given blk00979.dat
, it crashes with a BufferUnderflowException
.
When given blk00982.dat
, it crashes with an OutOfMemoryError
.
Ok thank you for the detailed information. I will test it later tonight.
On 9. Sep 2017, at 13:24, eandreev notifications@github.com wrote:
OK, I changed it to use count():
from pyspark.sql import SQLContext from pyspark import SparkContext
sc = SparkContext() sqlContext = SQLContext(sc)
df = sqlContext.read \ .format('org.zuinnote.spark.bitcoin.block') \ .options(magic='F9BEB4D9') \ .load('/data/btc/blocks/blk00969.dat') #blk00979.dat') blk00982.dat
def emit_big_payouts(block):
just do nothing
try: pass except: yield 1
exc = df.rdd \ .flatMap(emit_big_payouts) \ .count() When given blk00969.dat, it completes without errors. When given blk00979.dat, it crashes with a BufferUnderflowException. When given blk00982.dat, it crashes with a OutOfMemoryError.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
Here are my test results. I use HDP 2.5 with Spark 1.6.2. As an application I use the following scala application (it sums the outputs of all transactions in the file): https://github.com/ZuInnoTe/hadoopcryptoledger/wiki/Use-HadoopCrytoLedger-library-as-Spark-DataSource
Output for blk00969.dat - [217402878136804] (no errors) Output for blk00979.dat - NegativeArraySizeException Output for blk00982.dat - BufferUnderflowException Output for blk00990.dat - OutOfMemoryError
The same also if I avoid the datasource api. I confirm that there is still an issue, but it is strange. Sorry about this, but the specification in the BIP is not so detailed (e.g. the datastructure witness consists of number of witnesses (as varInt), then witness size (as varint) and witness script (as binary). I will have to dig a little bit in the Bitcoin Core code to determine what has changed. The unit test data contains for sure a block with witness data.
ok, unit tests now confirm that it happened already with the second block of file blk00979.dat. The first block already contains witness data and was used in all unit tests. I missed to implement a test containing several blocks for witness data - i was too fast with publishing. It should be not a big issue with fixing, but i want to spend now more time on it to make sure it really works. thank you again for your detailed testing it is highly appreciated!
after some exploration i found oiut that the error in blk00979 happens in the second block in a non-segwit transaction.
this is the block that causes trouble: https://blockchain.info/de/block/000000000000000000a114f77a4e02373cebaaa6aef547625f3706b81ce95964
it is transaction number 2493 (non-segwit, but i do not trust it to be non-segwit) that causes trouble transaction number 2492 is segwit. i will investigate later more. Maybe 2492 has some 0 segwit items that are not correctly interpreted (although the number of segwits is given beforehand normally).
I found the issue. The algorithms for parsing should be as follows: After you have read the outputs of a transaction, you have to do the following: if it is a segwit transaction: for all transactioninputs read varint number_of_items_on_stack for all number_of_items_on_stack read varint size_of_witness_data read size_of_witness_data bytes for fetching the script
Sorry about this. By chance i selected a random segwit data block which could be parsed fine because each transaction had only one input... I will do some testing with the other block data that you mention and then release another version. Sorry again
I expect an update with a final fix this week. It will be tested with all the files you mention. If you think even more tests are needed then please let me know
I tested all example applications with the files: Output for blk00969.dat,blk00979.dat, blk00982.dat ,blk00990.dat and they work with the just released version 1.0.6. Let me know if your issues are also fixed. Thank you a lot for reporting.
I will close it, because local tests show no issues (meanwhile 1.0.7 is out). Feel free to open a new one if the problem still persists.
Thanks a lot!
Hi,
I'm using the
com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.4
Spark package to read the blockchain files fetched by bitcoind and get the following exception when parsingblk00979.dat
: