NegativeArraySizeException when reading blk00979.dat

ZuInnoTe / spark-hadoopcryptoledger-ds

A Spark datasource for the HadoopCryptoLedger library

Apache License 2.0

13 stars 3 forks source link

NegativeArraySizeException when reading blk00979.dat #4

Closed eandreev closed 7 years ago

eandreev commented 7 years ago

Hi,

I'm using the com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.4 Spark package to read the blockchain files fetched by bitcoind and get the following exception when parsing blk00979.dat:

java.lang.NegativeArraySizeException
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:182)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:130)
        at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:182)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:130)
        at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

jornfranke commented 7 years ago

Thanks for reporting. I will check blk00979.dat again. However, can you please check that you do not have any rev.dat in the input folder. These files are not needed for processing and cannot be processed. They are by default in the Bitcoin folder together with the blk.dat files.

eandreev commented 7 years ago

I investigated the issue. It breaks because of the new segwit blocks introduced in BIP-144. More info on the new transaction format can be found here: https://github.com/bitcoin/bips/blob/master/bip-0144.mediawiki

BTW, thanks for such a great library!

jornfranke commented 7 years ago

i see. blk00979 is a recent file. Maybe it was not fully downloaded yet by Bitcoin core. Normally, i would not expect that it does not parse, but that it simply not loads the segwit data. For example, if it contains segwit data then the current parser see 0 transactions and skips the block.

jornfranke commented 7 years ago

https://github.com/ZuInnoTe/hadoopcryptoledger/issues/16

eandreev commented 7 years ago

According to the BIP, the format is not backward compatible, the old parsers will read the first segwit transaction in a block as a transaction with no inputs and one output. They introduced two new byte fields before the input counter: the first one is 0 (the old parser will interpret it as an input counter holding a 0) and the second one = 1. If you see these two bytes in a transaction, then you are looking at a segwit transaction that contains a witness script after the outputs.

jornfranke commented 7 years ago

yes this is clear, but the current parser should not have thrown an Exception. Anyway, since it is out now and blocking i will see for some emergency fix to read both transaction types and introduce later proper support (Utils etc.).

eandreev commented 7 years ago

BTW, there is a better description of the new format: https://github.com/bitcoin/bips/blob/master/bip-0141.mediawiki#Transaction_ID

jornfranke commented 7 years ago

I cannot yet estimate when this will be available, but this should be rather easy to implement. I originally planned some weeks more time for a full release with Ethereum, but I will delay it and fix this one first.

eandreev commented 7 years ago

Thanks a lot!

jornfranke commented 7 years ago

uploaded test data for segwitness, issue is verified/detect in unit testing. Todo: Add basic support for segwitness, add further unit tests, test in virtual cluster with mapreduce, spark, flink, hive

Most likely only the core library is affected (hadoopcryptoledger), mapreduce, spark, flink, hive do not need additional changes and can simply use the new library

jornfranke commented 7 years ago

just some update: it was fixed in the input format. It was designed to make the input format compatible with old version. I need to update still the spark data source, implement more unit tests, and run test of all components (flink, spark, spark2, hive, ...) on the virtual cluster. It will be then published as version 1.0.5. probably (no guarantee!) in the course of next week. This means you will be able to read the blocks containing segwit data including the segwit data. Extended segwit support (as described here https://github.com/ZuInnoTe/hadoopcryptoledger/issues/16) will be provided probably towards end of September.

jornfranke commented 7 years ago

We published 1.0.5 to Maven Central. The issue should not appear anymore with this version and Bitcoin blocks containing segwit data. I did several internal tests. Can you please test as well and provide feedback?

eandreev commented 7 years ago

I tested it on a recent copy of the blockchain and got two following exceptions:

(I passed the following setting to spark-submit in order to invoke the new build: --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5)

java.nio.BufferUnderflowException
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
        at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

and:

java.lang.IllegalArgumentException: Illegal Capacity: -2028274993
        at java.util.ArrayList.<init>(ArrayList.java:156)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:194)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
        at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

jornfranke commented 7 years ago

Can you please provide your parameters and some source code?

Do you know at which block it happens, i.e. The Blk*.dat file?

Do you have rev*.blk files? If so then please delete them.

On 7. Sep 2017, at 23:35, eandreev notifications@github.com wrote:

I tested it on a recent copy of the blockchain and got two following exceptions:

(I passed the following setting to spark-submit in order to invoke the new build: --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5)

java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) and:

java.lang.IllegalArgumentException: Illegal Capacity: -2028274993 at java.util.ArrayList.(ArrayList.java:156) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:194) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jornfranke commented 7 years ago

I meant rev*.dat files

On 7. Sep 2017, at 23:35, eandreev notifications@github.com wrote:

I tested it on a recent copy of the blockchain and got two following exceptions:

(I passed the following setting to spark-submit in order to invoke the new build: --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5)

java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) and:

java.lang.IllegalArgumentException: Illegal Capacity: -2028274993 at java.util.ArrayList.(ArrayList.java:156) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:194) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

eandreev commented 7 years ago

I gave it blk00979.dat and onwards and it produced the exceptions.

Here's the pyspark script:

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

df = sqlContext.read \
    .format('org.zuinnote.spark.bitcoin.block') \
    .options(magic='F9BEB4D9') \
    .load('/data/btc/blocks/blk00979.dat')

def emit_big_payouts(block):
    # just do nothing
    try:
        pass
    except:
        yield (fn,)

exc = df.rdd \
    .flatMap(emit_big_payouts) \
    .collect()

Here's the spark-submit command:

SPARK_MAJOR_VERSION=2 \
SPARK_HOME=/usr/hdp/current/spark2-client \
spark-submit \
    --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \
    --master yarn \
    --num-executors 1 \
    --driver-memory 1g \
    --executor-memory 1g \
    --executor-cores 1
    script-file.py

Just in case, here are the exceptions:

from blk00979.dat:

java.nio.BufferUnderflowException
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
        at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

from blk00981.dat

java.lang.NegativeArraySizeException
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:207)
        at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
        at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

eandreev commented 7 years ago

Also, I get out of memory exceptions on blk00982.dat even when I give the executor 2g of memory:

java.lang.OutOfMemoryError: Java heap space
    at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:227)
    at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134)
    at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
    at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
    at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

Here's the command:

SPARK_MAJOR_VERSION=2 \
SPARK_HOME=/usr/hdp/current/spark2-client \
spark-submit \
    --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \
    --master yarn \
    --num-executors 1 \
    --driver-memory 2g \
    --executor-memory 2g \
    --conf spark.yarn.executor.memoryOverhead=1g \
    --executor-cores 1 \
    script-file.py

jornfranke commented 7 years ago

Ok I used a block from this file for tests and it worked. However I will check the whole file maybe there is still a strange block inside. I will report back in 1-2 days.

On 8. Sep 2017, at 15:36, eandreev notifications@github.com wrote:

I gave it blk00979.dat and onwards and it produced the exceptions.

Here's the pyspark script:

from pyspark.sql import SQLContext from pyspark import SparkContext

sc = SparkContext() sqlContext = SQLContext(sc)

df = sqlContext.read \ .format('org.zuinnote.spark.bitcoin.block') \ .options(magic='F9BEB4D9') \ .load('/data/btc/blocks/blk00979.dat')

def emit_big_payouts(block):

just do nothing
try:
    pass
except:
    yield (fn,)
exc = df.rdd \ .flatMap(emit_big_payouts) \ .collect() Here's the spark-submit command:

SPARK_MAJOR_VERSION=2 \ SPARK_HOME=/usr/hdp/current/spark2-client \ spark-submit \ --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \ --master yarn \ --num-executors 1 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 Just in case, here are the exceptions:

from blk00979.dat:

java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:208) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) from blk00981.dat

java.lang.NegativeArraySizeException at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:207) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jornfranke commented 7 years ago

Hmm thx this is also strange because the library limits the default blocksize to 2M - so an out of memory exception is likely to come from the application .

On 8. Sep 2017, at 15:42, eandreev notifications@github.com wrote:

Also, I get out of memory exceptions on blk00982.dat even when I give the executor 2g of memory:

java.lang.OutOfMemoryError: Java heap space at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.parseTransactions(BitcoinBlockReader.java:227) at org.zuinnote.hadoop.bitcoin.format.common.BitcoinBlockReader.readBlock(BitcoinBlockReader.java:134) at org.zuinnote.hadoop.bitcoin.format.mapreduce.BitcoinBlockRecordReader.nextKeyValue(BitcoinBlockRecordReader.java:83) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Here's the command:

SPARK_MAJOR_VERSION=2 \ SPARK_HOME=/usr/hdp/current/spark2-client \ spark-submit \ --packages com.github.zuinnote:spark-hadoopcryptoledger-ds_2.11:1.0.5 \ --master yarn \ --num-executors 1 \ --driver-memory 2g \ --executor-memory 2g \ --conf spark.yarn.executor.memoryOverhead=1g \ --executor-cores 1 \ script-file.py — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

eandreev commented 7 years ago

blk00979.dat is one of the first files with sigwit blocks. Only a few of its blocks contain sigwit transaction records. Most likely, you have picked a block without the new kind of transactions.

jornfranke commented 7 years ago

No this block for sure contained a segwit transaction I verified that it actually read it and I manually inspected it in a hex editor.

On 8. Sep 2017, at 17:05, eandreev notifications@github.com wrote:

blk00979.dat is one of the first files with sigwit blocks. Only a few of its blocks contain sigwit transaction records. Most likely, you have picked a block without the new kind of transactions.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jornfranke commented 7 years ago

I just saw that your Python program does a collect which can for a lot of data indeed run into out of memory on the driver especially because you have a 128 mb fiel which is deserialized much bigger. Try a count or a count on exploded transactions (see the example application for the ds in https://github.com/zuinnote/hadoopcryptoledger

Nevertheless for the others issues I will check the library internally and run it on several blk*.dat

On 8. Sep 2017, at 17:05, eandreev notifications@github.com wrote:

blk00979.dat is one of the first files with sigwit blocks. Only a few of its blocks contain sigwit transaction records. Most likely, you have picked a block without the new kind of transactions.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

eandreev commented 7 years ago

OK, I changed it to use count():

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

df = sqlContext.read \
    .format('org.zuinnote.spark.bitcoin.block') \
    .options(magic='F9BEB4D9') \ 
    .load('/data/btc/blocks/blk00969.dat') #blk00979.dat') blk00982.dat

def emit_big_payouts(block):
    # just do nothing
    try:
        pass
    except:
        yield 1

exc = df.rdd \
    .flatMap(emit_big_payouts) \
    .count()

When given blk00969.dat, it completes without errors. When given blk00979.dat, it crashes with a BufferUnderflowException. When given blk00982.dat, it crashes with an OutOfMemoryError.

jornfranke commented 7 years ago

Ok thank you for the detailed information. I will test it later tonight.

On 9. Sep 2017, at 13:24, eandreev notifications@github.com wrote:

OK, I changed it to use count():

from pyspark.sql import SQLContext from pyspark import SparkContext

sc = SparkContext() sqlContext = SQLContext(sc)

df = sqlContext.read \ .format('org.zuinnote.spark.bitcoin.block') \ .options(magic='F9BEB4D9') \ .load('/data/btc/blocks/blk00969.dat') #blk00979.dat') blk00982.dat

def emit_big_payouts(block):

just do nothing
try:
    pass
except:
    yield 1
exc = df.rdd \ .flatMap(emit_big_payouts) \ .count() When given blk00969.dat, it completes without errors. When given blk00979.dat, it crashes with a BufferUnderflowException. When given blk00982.dat, it crashes with a OutOfMemoryError.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jornfranke commented 7 years ago

Here are my test results. I use HDP 2.5 with Spark 1.6.2. As an application I use the following scala application (it sums the outputs of all transactions in the file): https://github.com/ZuInnoTe/hadoopcryptoledger/wiki/Use-HadoopCrytoLedger-library-as-Spark-DataSource

Output for blk00969.dat - [217402878136804] (no errors) Output for blk00979.dat - NegativeArraySizeException Output for blk00982.dat - BufferUnderflowException Output for blk00990.dat - OutOfMemoryError

The same also if I avoid the datasource api. I confirm that there is still an issue, but it is strange. Sorry about this, but the specification in the BIP is not so detailed (e.g. the datastructure witness consists of number of witnesses (as varInt), then witness size (as varint) and witness script (as binary). I will have to dig a little bit in the Bitcoin Core code to determine what has changed. The unit test data contains for sure a block with witness data.

jornfranke commented 7 years ago

ok, unit tests now confirm that it happened already with the second block of file blk00979.dat. The first block already contains witness data and was used in all unit tests. I missed to implement a test containing several blocks for witness data - i was too fast with publishing. It should be not a big issue with fixing, but i want to spend now more time on it to make sure it really works. thank you again for your detailed testing it is highly appreciated!

jornfranke commented 7 years ago

after some exploration i found oiut that the error in blk00979 happens in the second block in a non-segwit transaction.

jornfranke commented 7 years ago

this is the block that causes trouble: https://blockchain.info/de/block/000000000000000000a114f77a4e02373cebaaa6aef547625f3706b81ce95964

it is transaction number 2493 (non-segwit, but i do not trust it to be non-segwit) that causes trouble transaction number 2492 is segwit. i will investigate later more. Maybe 2492 has some 0 segwit items that are not correctly interpreted (although the number of segwits is given beforehand normally).

jornfranke commented 7 years ago

I found the issue. The algorithms for parsing should be as follows: After you have read the outputs of a transaction, you have to do the following: if it is a segwit transaction: for all transactioninputs read varint number_of_items_on_stack for all number_of_items_on_stack read varint size_of_witness_data read size_of_witness_data bytes for fetching the script

Sorry about this. By chance i selected a random segwit data block which could be parsed fine because each transaction had only one input... I will do some testing with the other block data that you mention and then release another version. Sorry again

jornfranke commented 7 years ago

I expect an update with a final fix this week. It will be tested with all the files you mention. If you think even more tests are needed then please let me know

jornfranke commented 7 years ago

I tested all example applications with the files: Output for blk00969.dat,blk00979.dat, blk00982.dat ,blk00990.dat and they work with the just released version 1.0.6. Let me know if your issues are also fixed. Thank you a lot for reporting.

jornfranke commented 7 years ago

I will close it, because local tests show no issues (meanwhile 1.0.7 is out). Feel free to open a new one if the problem still persists.

eandreev commented 7 years ago

Thanks a lot!