Spark Local Runner throws with unhelpful error message over gcs input

jamesemery commented 6 years ago

I have noticed that when running spark tools (e.g. CountReadsSpark or MarkDuplicatesSpark) that running with an input in the form "CountReadsSpark -I gs://my-bucket-dir/my-file.bam." The tool crashes with the following unhelpful stacktraces:

java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:208)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1825)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1012)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:975)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469)
    at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1084)
    at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1072)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:679)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1072)
    at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:474)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getParallelReads(ReadsSparkSource.java:112)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.getUnfilteredReads(GATKSparkTool.java:254)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.getReads(GATKSparkTool.java:220)
    at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.runTool(MarkDuplicatesSpark.java:72)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:387)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:275)
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:206)
    ... 31 more

Followed by repetitions of the following stacktrace:

Feb 07, 2018 12:41:59 PM com.google.api.client.http.HttpRequest execute
WARNING: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:206)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1825)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1012)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:975)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469)
    at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1084)
    at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1072)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:679)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1072)
    at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:474)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getParallelReads(ReadsSparkSource.java:112)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.getUnfilteredReads(GATKSparkTool.java:254)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.getReads(GATKSparkTool.java:220)
    at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.runTool(MarkDuplicatesSpark.java:72)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:387)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:275)

Notable is the fact that I do not have a service key setup when executing these tests but rather have logged in using my google account.

mwalker174 commented 6 years ago

I just ran into this issue. I get the same error using my account and a service account.

I can work around this by adding the following Java lines in runTool():

ctx.hadoopConfiguration().set("fs.gs.project.id", "<PROJECT>"); ctx.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", "<KEYFILE>");

mwalker174 commented 6 years ago

Also, I do not get the error with non-spark tools.

droazen commented 6 years ago

@jean-philippe-martin Your thoughts on this one?

jean-philippe-martin commented 6 years ago

Sounds like a bug in the Cloud Storage Connector's handling of default credentials.

lbergelson commented 6 years ago

They don't handle default credentials. They need explicit credentials set via spark property like Mark is doing.

jean-philippe-martin commented 6 years ago

I guess missing a feature is a kind of bug. Is there something we should do on our end?

lbergelson commented 6 years ago

@jean-philippe-martin If the connector could configure itself with default credentials that would be amazing.

jean-philippe-martin commented 6 years ago

Sorry, by "our end" I meant on the GATK side. For the connector we can file an issue.

lbergelson commented 6 years ago

Oh, heh, darn, I was hoping you were volunteering to add default credential support :)

lbergelson commented 6 years ago

We already have an issue for it. https://github.com/GoogleCloudPlatform/bigdata-interop/issues/59

sooheelee commented 6 years ago

We have a researcher reporting that even when running locally, they are getting messages related to the GCS, which they find puzzling.

WARNING: Failed to detect whether we are running on Google Compute Engine.

plus what looks like an error stacktrace in the middle of the stdout. Is this intentional?

jean-philippe-martin commented 6 years ago

This error message is related to GATK's ability to load files on Google buckets ("gcs://bucket/file.bam"). This ability is enabled even when running locally (this aspect is intentional, because it's useful to be able to run a local GATK instance to process remote data without having to fire up a VM).

As the bucket-reading code ("NIO") initializes, it looks for credentials to use. Those can be set via an environment variable or via gcloud auth, as described in GATK's README. If neither of these are set, it checks whether it's currently running in a Google virtual machine (so it can figure out who owns the virtual machine that it's running on, and use those credentials). Apparently this code throws an exception if it runs out of ways to find credentials, and our code prints it out and moves on.

The message is useful, for if we were running in a google VM and the credential-finding failed, we'd certainly like to know. Whether we need the full stack trace, now, that's a choice we have to make.

lbergelson commented 6 years ago

It would be good if this didn't print a stack trace, and if the warning was less strident. It's confusing to get a stack trace about something on google when you're running something entirely local.

sooheelee commented 6 years ago

Thanks for the explanation @jean-philippe-martin. It would be great if we could make the messaging less alarming. The researcher who reported this is seasoned in bioinformatics.

lbergelson commented 6 years ago

Opening #5220 to deal with the stacktrace issue.

broadinstitute / gatk

Spark Local Runner throws with unhelpful error message over gcs input #4369