GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

Files from classpath are not properly resolved when classpath JAR contains META-INF with references to other dependencies. #538

Open metabrain opened 7 years ago

metabrain commented 7 years ago

Hello,

Intelij has a feature called dynamic.classpath which is used when the amount of items in the classpath required to launch the JVM+program exceed the maximum line size allowed by the OSs terminal. When this happens, Intelij prompts the user to activate this feature.

When active, a single JAR is used on the classpath which, in it, contains a reference to all other JARs required (instead of having them all laid out in 'java -classpath ...') in the META-INF/manifest file.

However when using it with dataflow, it will cause it to find a single dependency is required to be uploaded. 26 Jan 2017 13:39:24,496 [main]: com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.fromOptions(DataflowPipelineRunner.java:302) INFO {} - PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 1 files. Enable logging at DEBUG level to see which files will be staged.

It will cause the workflow to show up correctly in Dataflow dashboard, the pipeline+nodes of the graph all laid out correctly but nothing flows through the source. The logs show the the "Workers started correctly", however, in the worker logs accessible through Stackdriver, we can see that there are plenty of errors regarding starting up the containers and they keep retrying over and over again.

Steps to reproduce:

  1. Generate a new project using the maven archtype in the Dataflow tutorial. 1,5. Import porject to intelij using pom.xml that was generated.
  2. Then edit the .idea/workspace.xml file to contain
    <component name="PropertiesComponent">
    ...
    <property name="dynamic.classpath" value="true" />
    ...
    </component>
  3. Run any of the WordCount samples - MinimalWordCount would be the simplest.
  4. Dataflow is now uploaded (only one file uploaded to staging area...) and nothing flows from source.

Only setting it back to 'false' will fix it. My development environment is Windows which I believe has a low limit for terminal lines, hence why I had to use this feature at some point.

I am not aware how prevalent this issue might be with more complex JAR dependencies, but I have seen it happening due to Intelij dynamic.classpath feature.

Cheers, Dan

dhalperi commented 7 years ago

Hi Dan,

Thanks for this report. It indeed looks like we did not implement handling for a jar with a Class-Path manifest, which is now used in IntelliJ 15+. We'll need to implement this support; I've filed BEAM-1325: DataflowRunner support for Class-Path jars.

dhalperi commented 7 years ago

For now, you can either turn off dynamic classpath or use the --filesToStage option to manually pass in an explicit list of the jars to stage.

You can also use maven-shade-plugin to build a bundled jar that contains all the code. We do this for our examples: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/pom.xml#L307

metabrain commented 7 years ago

Many thanks Dan, hope my report was useful. For now I will keep working with dynamic.classpath disabled, no issues there.

Cheers, Dan