GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 323 forks source link

Dataflow /var/opt/google/dataflow directory created as /var/opt/google/dataflow/dataflow #636

Open junying1 opened 6 years ago

junying1 commented 6 years ago

Dataflow is not aboutable to find files packaged with my classes. I use Class.getResource("/data.json"). Stackdriver log shows it's looking for the file in /var/opt/google/dataflow/some-random-jar-name.jar!/data.json. When I ssh into the VM instance for the worker, the file is actually in /var/opt/google/dataflow/dataflow/some-random-jar-name.jar.jar. This was working as of 5/9/18.

I tested with the WordCount example straight from Apache Beam documentation: https://beam.apache.org/get-started/quickstart-java/

Followed all the steps. Then added a "resources/data.json" to "src/main". Added the following lines to WordCount.ExtractWordsFn's processElement method:

 try {
  String jsonStr = new Scanner(new File(WordCount.class.getResource("/data.json").getFile())).useDelimiter("\\Z").next();
  System.out.println("====================================================");
  System.out.println(jsonStr);
  System.out.println("====================================================");
} catch (Exception e) {
  e.printStackTrace();
}

Sure enough, it runs fine locally with DirectRunner, but with DataflowRunner, I got the same error in stack driver:

message: "java.io.FileNotFoundException: file:/var/opt/google/dataflow/classes-yGX0uczTTR8A8LXakSr0JA.jar!/data.json (No such file or directory)"

While the example batch is still running, I ssh'ed into the worker instance and checked /var/opt/google/dataflow. There is another "dataflow" directory, and the files are copied there. So confirmed the double dataflow directory issue.

junying1 commented 6 years ago

I worked out a workaround: use Class.getResourceAsStream to get an inputstream. For whatever reason, getResourceAsStream functioned as expected, while getResource still fails. For all of my purposes, an inputstream works just as well as a URL.