GoogleCloudDataproc / hadoop-connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Apache License 2.0
280 stars 238 forks source link

Hadoop Connector consistently fails to read specific shards from BigQuery #33

Closed jfratzke closed 6 years ago

jfratzke commented 8 years ago

When extracting data from BigQuery, I see the following on specific tables.

java.lang.IndexOutOfBoundsException at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:138) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184) at com.google.cloud.hadoop.io.bigquery.GsonRecordReader.nextKeyValue(GsonRecordReader.java:87) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)

This issues always occurs on the exact same shard. If I change ENABLE_SHARDED_EXPORT_KEY to false I don't have any issue (but this is much slower). The size of the json file in one shard created on gcs is 2.36 GB with 8 shards.

medb commented 6 years ago

Sharded export is deprecated and disabled by default starting from BigQuery connector 0.13.0, because current BigQuery export is more efficient than BigQuery connector sharded export.

Closing this issue as an obsolete.