GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

Decompressing bzip2 files with multiple "streams" only reads the first stream leading to data loss #596

Open lukecwik opened 7 years ago

lukecwik commented 7 years ago

This is an issue found in Apache Beam (https://issues.apache.org/jira/browse/BEAM-2708) and has been found to impact Dataflow SDK for Java 1.6.0 to 1.9.0.

The fix has been backported with https://github.com/GoogleCloudPlatform/DataflowJavaSDK/pull/592