apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.71k stars 4.2k forks source link

[Bug]: TextIO.read().withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW) still fails if no file is found #31296

Open arkadioz opened 2 months ago

arkadioz commented 2 months ago

What happened?

According to the documentation https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/TextIO.html

We can configure the behaviour of the read() method to also allow no matches or empty matches, so I configured a pipeline step like this:

PCollection<KV<String, Incident>> incidents = pipeline.apply("READ CSV", TextIO.read().withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW).from("gs://bucket/incidents.csv"))
                .apply("Convert to bean and output KV Collection", ParDo.of(new IncidentCsvToBeanFunction()));
// The IncidentCsvToBeanFunction is just mapping the csv content to a java class.

but it still fails to start the pipeline even with this configuration when the incidents.csv is not present at the bucket, but when I use a wildcard (*) for example incidents.csv or .csv it works even if the incidents.csv does not exists, but according to the documentation and what I understand is that with the .withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW) it should work with just incidents.csv and no wildcards even if the csv is not present, so I consider it a bug unless I misunderstood... The error im getting on google cloud dataflow logs is:

"[The preflight pipeline validation failed for job 2024-05-14_15_25_49-jobnumbers. To bypass the validation, use the Dataflow service option with the value enable_preflight_validation=false. Learn more at https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#validation] NOT_FOUND: Unable to find object incidents.csv in bucket bucket_name.

What I expect is the same behaviour when I use the wildcard, just continue the pipeline and return an empty PCollection, even when the csv is not present in the bucket (does not exists).

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

arkadioz commented 2 months ago

Sorry I think the problem is the dataflow preflight service, I tried what this link in the error message says: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline?hl=es-419#validation

using: --dataflowServiceOptions=enable_preflight_validation=false

and it allowed the pipeline to start and work as expected, but I believe this is bad because what if I still want the preflight service on? the preflight service should also be able to recognize about the .withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW) and allow the start of the pipeline even if it detects separately that the csv does not exists in the bucket

liferoad commented 2 months ago

This looks like a bug in the service. Can you open a Google cloud support case?

arkadioz commented 2 months ago

@liferoad Sure could do it, can you please share the link to open one? Have not open one before unless you mean the google cloud community forum?

liferoad commented 2 months ago

Please check this: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests