apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.88k stars 4.26k forks source link

[Bug]: Python sdk harness failed: TypeError: can only concatenate str (not "NoneType") to str #28131

Closed dummy-work-account closed 1 year ago

dummy-work-account commented 1 year ago

What happened?

I'm trying to send custom metrics to datadog using a DoFn but the python sdk test harness is failing with an error that I don't know how to interpret:

 Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 181, in main
    sdk_harness.run()
  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 256, in run
    getattr(self, SdkHarness.REQUEST_METHOD_PREFIX + request_type)(
TypeError: can only concatenate str (not "NoneType") to str

I'm running the streaming pipeline as a flex template in gcp dataflow, using the python requests module for POST calls


Issue Priority

Priority: 3 (minor)

Issue Components

dummy-work-account commented 1 year ago

When I comment out the windowinto line the error goes away but the pipeline still doesn't function as expected -- which might be an issue with my custom DoFn

    with beam.Pipeline(options=options) as pipeline:
        messages = (
            pipeline
            | f"Read from input topic {subscription_id}" >>
            beam.io.ReadFromPubSub(subscription=subscription_id,
                                   with_attributes=False)
            | f"Deserialize Avro {subscription_id}" >> beam.ParDo(
                ConfluentAvroReader(schema_registry_conf)).with_outputs(
                    "record", "error"))

        records = messages["record"]
        errors = messages["error"]

        (records
         | 'Aggregate msgs in fixed window' >> beam.WindowInto(beam.window.FixedWindows(15))
         | 'Send hardcoded value to datadog' >> beam.ParDo(SendToDatadog())
         | 'Print results' >> beam.Map(print)
        )
tvalentyn commented 1 year ago

The error happens in the SDK internals and is rather strange. It sounds as though a runner sends a malformed request to the SDK. I would suggest you try again and if the issue still persist provide a minimal pipeline that reproduces it, that we could try out, or work with Dataflow customer support.

dummy-work-account commented 1 year ago

Thank you for the feedback! The issue has persisted -- I'm working on a minimal pipeline that can reproduce the error now

tvalentyn commented 1 year ago

Hi, any news about the repro? Thanks!

dummy-work-account commented 1 year ago

Thanks for the followup! I ended up going a different route, by removing the datadog module from the pipeline and setting up a separate container to relay messages from pubsub to datadog. It feels like beam/dataflow is structured around data transformation -- trying to coerce it to do external API calls seems like a dumb idea in hindsight (especially on an unpeered-vpc)