GoogleCloudDataproc / hadoop-connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Apache License 2.0
280 stars 238 forks source link

Got exception: Connection closed prematurely #12

Closed motymichaely closed 6 years ago

motymichaely commented 9 years ago

Hey there,

We recently got issues with getting into errors when reading files from Cloud Storage:

2015-09-30 23:15:17,334 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:u_1661 cause:java.io.IOException: Error reading gs://example-bucket/some-file.gz at position 20971520
java.io.IOException: Error reading gs://example-bucket/some-file.gz at position 20971520
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStreamAndSetMetadata(GoogleCloudStorageReadChannel.java:667)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:555)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:289)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:158)
  at java.io.DataInputStream.read(DataInputStream.java:149)
  at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:151)
  at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:135)
  at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
  at java.io.InputStream.read(InputStream.java:101)
  at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
  at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
  at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:139)
  at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:259)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
  at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:530)
  at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:363)
  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.net.SocketTimeoutException: Read timed out
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(SocketInputStream.java:152)
  at java.net.SocketInputStream.read(SocketInputStream.java:122)
  at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
  at sun.security.ssl.InputRecord.read(InputRecord.java:480)
  at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
  at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
  at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
  at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
  at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
  at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
  at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
  at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:4680)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStreamAndSetMetadata(GoogleCloudStorageReadChannel.java:651)
  ... 23 more

It also seems like there's a bug in the log output of the current retry.

Any idea why this can occur? It seems like intermittent issues but I want to make sure.

Thanks

medb commented 6 years ago

Recently we have added new configuration properties to control reads from GCS, please use them to workaround any intermittent issues while reading from GCS.