GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 323 forks source link

Use emulators in local development #466

Open RXminuS opened 8 years ago

RXminuS commented 8 years ago

Is there any (documented) way that I can use the PubSub, Bigtable and Datastore emulators during integration testing?

I remember that in Node the client libraries would look for the existence of an environment variable and even had explicit configuration options but searching through the DataflowJavaSDK repo I can't seem to find any reference to such a feature.

RXminuS commented 8 years ago

Ok, after some digging I found this in the code so there is options provided for setting the PubsubRoot url (which is needed to use the emulator) so it just seems a documentation issue.

   @Override
    public PubsubClient newClient(
        @Nullable String timestampLabel, @Nullable String idLabel, DataflowPipelineOptions options)
        throws IOException {
      Pubsub pubsub = new Builder(
          Transport.getTransport(),
          Transport.getJsonFactory(),
          chainHttpRequestInitializer(
              options.getGcpCredential(),
              // Do not log 404. It clutters the output and is possibly even required by the caller.
              new RetryHttpRequestInitializer(ImmutableList.of(404))))
          .setRootUrl(options.getPubsubRootUrl())
          .setApplicationName(options.getAppName())
          .setGoogleClientRequestInitializer(options.getGoogleApiTrace())
          .build();
      return new PubsubJsonClient(timestampLabel, idLabel, pubsub);
    }
RXminuS commented 8 years ago

An update...it works for PubSub by setting the --pubsubRootUrl variable to whatever you need and having your Options inherit from #DataflowPipelineDebugOptions but no such options seem to exist for the other Google services which have emulators such as Datastore.

This makes it hard to perform integration tests. I'm assuming we can just copy the style of the solution from PubSub to Datastore. Any thoughts / objections...otherwise I'll make a pull request.

dhalperi commented 8 years ago

I think this should probably be added to the Datastore source/sink builders rather than to a global pipeline options. This would encapsulate configuration in the right place.

Also, could you please make the change first in Apache Beam (which will be the basis of Dataflow 2.0)? Beam contribution guide