Open jamesemery opened 6 years ago
I just ran into this issue. I get the same error using my account and a service account.
I can work around this by adding the following Java lines in runTool():
ctx.hadoopConfiguration().set("fs.gs.project.id", "<PROJECT>"); ctx.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", "<KEYFILE>");
Also, I do not get the error with non-spark tools.
@jean-philippe-martin Your thoughts on this one?
Sounds like a bug in the Cloud Storage Connector's handling of default credentials.
They don't handle default credentials. They need explicit credentials set via spark property like Mark is doing.
I guess missing a feature is a kind of bug. Is there something we should do on our end?
@jean-philippe-martin If the connector could configure itself with default credentials that would be amazing.
Sorry, by "our end" I meant on the GATK side. For the connector we can file an issue.
Oh, heh, darn, I was hoping you were volunteering to add default credential support :)
We already have an issue for it. https://github.com/GoogleCloudPlatform/bigdata-interop/issues/59
We have a researcher reporting that even when running locally, they are getting messages related to the GCS, which they find puzzling.
WARNING: Failed to detect whether we are running on Google Compute Engine.
plus what looks like an error stacktrace in the middle of the stdout. Is this intentional?
This error message is related to GATK's ability to load files on Google buckets ("gcs://bucket/file.bam"). This ability is enabled even when running locally (this aspect is intentional, because it's useful to be able to run a local GATK instance to process remote data without having to fire up a VM).
As the bucket-reading code ("NIO") initializes, it looks for credentials to use. Those can be set via an environment variable or via gcloud auth
, as described in GATK's README. If neither of these are set, it checks whether it's currently running in a Google virtual machine (so it can figure out who owns the virtual machine that it's running on, and use those credentials). Apparently this code throws an exception if it runs out of ways to find credentials, and our code prints it out and moves on.
The message is useful, for if we were running in a google VM and the credential-finding failed, we'd certainly like to know. Whether we need the full stack trace, now, that's a choice we have to make.
It would be good if this didn't print a stack trace, and if the warning was less strident. It's confusing to get a stack trace about something on google when you're running something entirely local.
Thanks for the explanation @jean-philippe-martin. It would be great if we could make the messaging less alarming. The researcher who reported this is seasoned in bioinformatics.
Opening #5220 to deal with the stacktrace issue.
I have noticed that when running spark tools (e.g. CountReadsSpark or MarkDuplicatesSpark) that running with an input in the form "CountReadsSpark -I gs://my-bucket-dir/my-file.bam." The tool crashes with the following unhelpful stacktraces:
Followed by repetitions of the following stacktrace:
Notable is the fact that I do not have a service key setup when executing these tests but rather have logged in using my google account.