GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

TextIO - add "allow empty" flag #556

Open rantibi opened 7 years ago

rantibi commented 7 years ago

Since SDK 1.9: Changed FileBasedSource to throw an exception when reading from a file pattern that has no matches. Pipelines will now fail at runtime rather than silently reading no data in this case.

In some of our pipelines we read from multiple buckets, and some of them could be empty in some cases (we still process data from the other buckets). For this reason, while trying to upgrade to 1.9 we have to change our code so that if will check first if there is files in the bucket and return empty PCollection in case there is no files that match to the pattern.

It will be very useful if TextIO will have a flag allowEmpty() that will allow silently reading no data (like withoutValidation() in the init time).

davorbonaci commented 7 years ago

Indeed -- this would be a nice improvement.

Would you perhaps be interested in contributing that feature to the Apache Beam codebase in https://github.com/apache/beam?