GoogleCloudPlatform / google-cloud-eclipse

Google Cloud Platform plugin for Eclipse
Apache License 2.0
86 stars 49 forks source link

Dataflow wizard and run configurations should use the existing cloud tools authentication #2002

Closed tgroh closed 7 years ago

tgroh commented 7 years ago

Currently, the Dataflow plugin relies on the Cloud SDK command line tool to be installed and set up. However, the cloud tools plugin already has an authentication feature. Requiring users to authenticate multiple times in multiple locations is painful, and authenticating within the IDE requires fewer touches and provides a smoother experience.

briandealwis commented 7 years ago

I'm just digging into this. Our login supports multiple accounts, and we don't support a default account as it can get confusing. I'm not a data flow expert (yet!) so may I ask: is specifying a staging location something that must be done during project creation (e.g., for debugging a pipeline locally), or can it be deferred to deployment time?

chanseokoh commented 7 years ago

@tgroh could you answer https://github.com/GoogleCloudPlatform/google-cloud-eclipse/issues/2002#issuecomment-310202171? We are wondering if there is a need to associate a GCP project (and auth) before deploying, and if not, it'd make sense to ask for them only at the time of deploying.

tgroh commented 7 years ago

Sorry, I seem to have missed that.

A staging location is not required when running a pipeline locally, only when executing on the Dataflow Service with the DataflowRunner. This can be configured per-execution.

chanseokoh commented 7 years ago

@tgroh We pass a GCP project ID and a bucket through arguments to main(). How can we pass a Credential (OAuth2 access token) then?

 * <p>To run this starter example using managed resource in Google Cloud
 * Platform, you should specify the following command-line options:
 *   --project=<YOUR_PROJECT_ID>
 *   --stagingLocation=<STAGING_LOCATION_IN_CLOUD_STORAGE>
 *   --runner=DataflowRunner
tgroh commented 7 years ago

All of the Dataflow sdks use the getCredentialFactoryClass() to create the Credentials used by Dataflow. I don't believe there's a built-in Crediential Factory that takes either the serialized form or that reads from a file that isn't stored within the default location. Within dataflow 1.x it may be possible to use the getSecretsFile with appropriate contents, but I'm not positive that this is the case. Otherwise, it may be possible to construct a JAR that can be prepended to the dependencies of a project which contains a CredentialFactory class that reads from either a Pipeline Option or a file (and I believe would use GoogleCredentials.fromStream()). That may be appropriate to merge into current Beam head.

chanseokoh commented 7 years ago

I have more questions:

lukecwik commented 7 years ago
chanseokoh commented 7 years ago

Update:

In our scenario, we need to pass a CT4E Credential down to a user's program, so we have to write a temporary JSON file. Then it turns out it's straightforward to build Credential from JSON using GoogleCredential.fromStream() (for Dataflow 1.x) and Credentials by GoogleCredentials.fromStream() (for Dataflow 2.x).

--credentialFactoryClass=com.company.MyCredentialFactory

This works very well.

chanseokoh commented 7 years ago

Now, here's the part that I am confused about:

Our CT4E Credentials are built with a client ID and a client secret of an internal GCP project that is assigned to our app CT4E. (So, if our CT4E makes an API call, our internal project is billed, I think.)

I think we need to replace the client ID and the secret with those of a user's project on which a Dataflow job is to run. Not sure how I can get this client ID and secret from where. Maybe I misunderstand how a Credential should work in this scenario.

chanseokoh commented 7 years ago

Or, maybe it's correct to use the internal CT4E project's client ID and secret. All I need to do is to turn on the Dataflow API on the internal project.

elharo commented 7 years ago

Hmm, that sounds very off. We shouldn't be billed when the user runs their dataflow app in their GCP project.

briandealwis commented 7 years ago

Shouldn't we be requesting some kind of Dataflow admin scope so that we can submit jobs on their behalf? The internal project is just for tracking on accounts.google.com so the logged-in user can see the permissions requested/granted to CT4E. Billing on the user's dataflow project will go to the project billing account.

chanseokoh commented 7 years ago

Shouldn't we be requesting some kind of Dataflow admin scope so that we can submit jobs on their behalf?

My understanding so far is that, the Dataflow plugin in our CT4E does not submit jobs on users' behalf. It's the user program (a regular Java program with main()) that submits the job, and what we do is actually running a user's program.

chanseokoh commented 7 years ago

The internal project is just for tracking on accounts.google.com so the logged-in user can see the permissions requested/granted to CT4E.

Maybe I'm talking about something else, but I believe the internal project is more than just tracking. The project enables App Engine Admin API, Cloud Resources Manager API, Compute Engine API, etc, and if we disable such APIs, deploying from CT4E won't work, for example.

chanseokoh commented 7 years ago

@tgroh I see some issue with the current workaround to augment users' project by supplying an auxiliary CredentialFactory class in our scenario. The class needs to know the location of a credential JSON file, and the only way it could get the path is through PipeilneOptions passed to fromOptions() below:

public class MyCredentialFactory implements CredentialFactory {

    private final Path tempCredFileCreatedByCt4e;

    private MyCredentialFactory(Path tempCredFileCreatedByCt4e) {
        this.tempCredFileCreatedByCt4e = tempCredFileCreatedByCt4e;
    }

    @Override
    public Credential getCredential() throws IOException, GeneralSecurityException {
        try (InputStream in = Files.newInputStream(tempCredFileCreatedByCt4e)) {
            return GoogleCredential.fromStream(in);
        }
    }

    public static MyCredentialFactory fromOptions(PipelineOptions options) {
        Path tempCredFileCreatedByCt4e = null /* get this from "options" */;
        return new MyCredentialFactory(tempCredFileCreatedByCt4e);
    }
}

The problem is that we can't seem to pass an arbitrary option unless we define a custom PipelineOptions by extending it. Any thoughts?

tgroh commented 7 years ago

If you've augmented their project by adding a class, you should be able to also create a PipelineOptions interface with that option, and a PipelineOptionsRegistrar, which in combination should allow you to specify that path on the command line.

chanseokoh commented 7 years ago

Looks like the only way to make remote Dataflow jobs to pick up CT4E login credentials instead of the default application credential is to drop a custom-built JAR library to users' projects. This will involve the following work on our side:

  1. Set up a new repo for the JAR library.
    • This will be a small Maven project.
    • Actually, we need to create two JARs, for Dataflow v1.x and for Dataflow v2.x.
    • CT4E won't make direct use of any classes in these JARs. We just need the JARs themselves, so that we can drop them to users' projects.
    • I don't think it's a good idea to put the JARs on Maven Central. We will just embed the JARs into our CT4E release binaries.
  2. Our Dataflow bundles need to add one of these JARs to a user's Dataflow project when creating it through our Dataflow wizard.
    1. We can put the JAR into a directory <project-root>/ct4e-extra-lib.
    2. The JAR is only needed to enable passing down CT4E login credentials to user's (Dataflow) program. That said, we can add a README file to the <project-root>/ct4e-extra-lib directory saying something like
      Required only to enable running Dataflow v1.x jobs on the Google Cloud Platform
      by Cloud Tools for Eclipse (CT4E) inside Eclipse.
      You can safely remove this artifact if you do not run Dataflow jobs using CT4E.
    3. Eventually, there should be a way to easily add these JARs to any existing projects. Even so, users who forget to add these JARs or who are not aware that they need to add these JARs would have trouble figuring out why their CT4E login does not work.

All of these complications, and particularly the downside of 2.iii, will be eliminated if the things in these JARs (i.e., making Dataflow have a way to load credentials from JSON) are integrated into the Dataflow SDKs. @tgroh will it be easy to do so in a reasonable short time frame?

lukecwik commented 7 years ago

Note that the default application credentials are configurable via the GOOGLE_APPLICATION_CREDENTIALS environment variable. As long as the user's application runs with that environment variable pointing to a valid json credentials file you won't need to inject any code into the user's application.

The Google OAuth library has a section detailing the order in which credentials are resolved from the applications environment.

chanseokoh commented 7 years ago

@lukecwik Cool, that could be another workaround. But I am not sure overriding a user's env variable is a good idea. Users might have set that up for their purposes. Moreover, this may not work when running multiple Dataflow jobs and other user tasks depending on this env var concurrently.

elharo commented 7 years ago

Yes, we do need to work with the dataflow SDK; but in the meantime GOOGLE_APPLICATION_CREDENTIALS sounds like the simpler and less invasive approach.

On Fri, Jul 14, 2017 at 2:42 AM, Chanseok Oh notifications@github.com wrote:

@lukecwik https://github.com/lukecwik Cool, that could be another workaround. But I am not sure overriding a user's env variable is a good idea. Users might have set that up for their purposes. Moreover, this may not work when running multiple Dataflow jobs and other user tasks depending on this env var concurrently.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/google-cloud-eclipse/issues/2002#issuecomment-315283790, or mute the thread https://github.com/notifications/unsubscribe-auth/AA9X6B8QE4EhxGTA6PfE6CajkQrKiN9qks5sNw3DgaJpZM4Nughn .

-- Elliotte Rusty Harold elharo@macfaq.com

chanseokoh commented 7 years ago

I agree. I mistakenly assumed such an env var will affect vars outside users' Dataflow programs. This sounds like a good and simple workaround.

chanseokoh commented 7 years ago

@lukecwik some dumb questions about the OAuth client ID and the client secret:

The credential we construct in CT4E after a user logs in has a client ID and a secret of our internal projects. That is, our CT4E application as a client gets permission from users to access their accounts. Therefore, when we write down such a credential to JSON, it will contain the internal project's ID and secret. For Dataflow programs, using this client ID does not make sense. Where should these ID and secret com from and how can I obtain them? Is it that the credential should be a different "authorization grant type"? Sorry I'm largely ignorant of this stuff.

chanseokoh commented 7 years ago

@lukecwik Never mind. I talked to the Cloud SDK team, and gcloud also faces this same "weird" situation.

The application default cred created by gcloud auth application-default login is basically a proxy (which provides convenience for local testing, for example), and its client ID and client secret are from a pre-assigned internal project for gcloud. So, AFAIK, if users were using this gcloud application default cred when running a Dataflow pipeline program, on the server side, API calls from this user program are seen as calls from gcloud. As such, these API calls are limited by the quota limits set by the internal gcloud project. This situation is certainly weird in that this does not reflect the reality, but that's the current workings due to some limitation. As for billing, although API quotas are controlled by the internal project, many APIs are not billed for simply making API calls but rather billed for actual resource usage. This includes the GCS API, so we think we are okay with Dataflow if we use our CT4E internal project, much like gcloud does. At the current state, using our CT4E project is not much different from gcloud already using its project.

BTW, not every APIs can be billed per actual resource usage. For example, the machine learning API is billed for API calls themselves, so the gcloud internal project disabled the API. That means, the gcloud application default cred won't work with user programs using such an API.

TL; DR We will just use our login credential with the CT4E client ID for Dataflow (https://github.com/GoogleCloudPlatform/google-cloud-eclipse/issues/2002#issuecomment-314880233), although this may look a bit weird.

chanseokoh commented 7 years ago

@tgroh one last question. When running locally with Dataflow v1.x (i.e., using DirectPipelineRunner), a user's program doesn't get the staging location parameters but still gets the GCP project ID parameter. Does it make any use of it? Does running locally still requires authentication?

tgroh commented 7 years ago

Running locally shouldn't require authentication, though we may still use it for accessing buckets and the like if we connect to an existing service.

On Wed, Jul 19, 2017 at 3:08 PM, Chanseok Oh notifications@github.com wrote:

@tgroh https://github.com/tgroh one last question. When running locally with Dataflow v1.x (i.e., using DirectPipelineRunner), a user's doesn't get the staging location parameters but still gets the GCP project ID parameter. Does it make any use of it? Does running locally still requires authentication?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/google-cloud-eclipse/issues/2002#issuecomment-316533003, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaPKtuD7j2MHTqZuCNv89gFMv4uKbrdks5sPn5lgaJpZM4Nughn .

chanseokoh commented 7 years ago

@tgroh that's a confusing answer. So, should we set GOOGLE_APPLICATION_CREDENTIALS for DirectPipelineRunner (and DirectRunner if applicable) or not? Or should we set it only when a user has set a Google account and that account is logged in in CT4E?

lukecwik commented 7 years ago

Dataflow still uses the project associated with the caller and not with the resource being used. There is an outstanding issue to migrate to the resource project for quite a long time.