Closed jfratzke closed 6 years ago
Sharded export is deprecated and disabled by default starting from BigQuery connector 0.13.0, because current BigQuery export is more efficient than BigQuery connector sharded export.
Closing this issue as an obsolete.
When extracting data from BigQuery, I see the following on specific tables.
java.lang.IndexOutOfBoundsException at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:138) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184) at com.google.cloud.hadoop.io.bigquery.GsonRecordReader.nextKeyValue(GsonRecordReader.java:87) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
This issues always occurs on the exact same shard. If I change ENABLE_SHARDED_EXPORT_KEY to false I don't have any issue (but this is much slower). The size of the json file in one shard created on gcs is 2.36 GB with 8 shards.