GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.15k stars 966 forks source link

PubSub to Splunk template fails after latest update #295

Closed felix-d closed 1 year ago

felix-d commented 3 years ago

Our Splunk dataflow job started failing after being cloned and restarted today. I noticed the version had been bumped to 2021-09-13-00_rc00.

Forcing a previous version of the template (2021-03-08-01_RC00) fixed the issue. This is likely the one we were using before the job was restarted.

Stack trace:

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 27; received: 0)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
    at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
    at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142)
    at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
    at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
    at com.google.api.client.http.HttpResponse.ignore(HttpResponse.java:427)
    at com.google.api.client.http.HttpResponse.disconnect(HttpResponse.java:441)
    at com.google.cloud.teleport.splunk.SplunkEventWriter.flush(SplunkEventWriter.java:266)
    at com.google.cloud.teleport.splunk.SplunkEventWriter.processElement(SplunkEventWriter.java:184)

Looks like a commit was shipped today to prevent this error from disrupting the job. https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/d7a1d3ce5f3301ef6c79b4d1b40b2c2ff5700cbd

Therefore, this might be a non-issue for the next version but I still wanted to raise the flag.

onetwopunch commented 3 years ago

+1 I'm getting the same message after restarting our Splunk Dataflow job today (just cloned the old one) because I realized that it had a public IP and shouldn't have. Still getting logs but they're coming in at much different intervals than expected which triggered some PagerDuty alerts.

image

We have a watchdog alert that checks in once every two minutes from two hosts. As you can see the chart goes crazy right after I deployed the new template. I'm not even sure how to roll back with Dataflow templates since they don't seem to be versioned so if anyone can help out there, I'd appreciate it.

prathapreddy123 commented 3 years ago

+1 I'm getting the same message after restarting our Splunk Dataflow job today (just cloned the old one) because I realized that it had a public IP and shouldn't have. Still getting logs but they're coming in at much different intervals than expected which triggered some PagerDuty alerts.

image

We have a watchdog alert that checks in once every two minutes from two hosts. As you can see the chart goes crazy right after I deployed the new template. I'm not even sure how to roll back with Dataflow templates since they don't seem to be versioned so if anyone can help out there, I'd appreciate it.

All the templates (including old ones) will be available in public dataflow-templates GCS bucket. Latest release happened on 09/20/2021. To find all releases in the current year gsutil ls "gs://dataflow-templates/2021-*" | grep Splunk$

If you want to pick up different version of the template in UI, choose the custom template option under Dataflow Template dropdown and provide appropriate template path. e.g dataflow-templates/2021-09-13-00_RC00/Cloud_PubSub_to_Splunk for previous version

onetwopunch commented 3 years ago

Thanks @prathapreddy123 I'll try that.

Update: Ok that worked, but the 09/13 release was still busted with the same error. I ended up going back to the 2021-08-03 release which is what was working before. It's unsettling that this could have been so obviously broken for so long without anyone seeming to notice 😞

danekantner commented 3 years ago

2021-08-30-00_RC00 is the newest template we can get to work, 9-13 breaks it

this also coincides with the underlying SDK changing from 2.29 to 2.32 AFAIK

danekantner commented 3 years ago

With no other changes apparent, it seems the 2021-10-04-00_RC00 template works better, in that it works at all now; but we're still seeing now a slightly different variation of that same error being logged

2021-10-12T17:20:24.269ZError trying to disconnect from Splunk: Premature end of Content-Length delimited message body (expected: 27; received: 0) Messages should still have either been published or prepared for error handling, but there might be a connection leak. Stack Trace: [org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178), org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198), org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101), org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142), org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228), org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172), com.google.api.client.http.HttpResponse.ignore(HttpResponse.java:427), com.google.api.client.http.HttpResponse.disconnect(HttpResponse.java:441), com.google.cloud.teleport.splunk.SplunkEventWriter.flush(SplunkEventWriter.java:273), com.google.cloud.teleport.splunk.SplunkEventWriter.processElement(SplunkEventWriter.java:184), com.google.cloud.teleport.splunk.AutoValue_SplunkEventWriter$DoFnInvoker.invokeProcessElement(Unknown Source), org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232), org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:188), org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:339), org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44), org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49), org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212), org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163), org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92), org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1435), org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:165), org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1111), java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128), java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628), java.base/java.lang.Thread.run(Thread.java:834)]

It seems it maybe to relate to the change in https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/d7a1d3ce5f3301ef6c79b4d1b40b2c2ff5700cbd

rarsan commented 2 years ago

Folks, as noted above, the issue was unfortunately introduced in 2021-09-13-00_RC00 release when we upgraded underlying HTTP Java client library. That introduced this change in how HTTP response disconnect is implemented.

The issue was mitigated (not fixed) as of 2021-09-27-00_RC00 release by safely catching these errors (and logging warnings instead) in order to ensure log delivery is uninterrupted. Yes, that means there will a warning message per batched request (not per log message). That warning can be safely ignored until we get a fix from the dependent library (http client) that we can incorporate in upcoming release. If you wish to reduce worker log verbosity, see Dataflow docs to set worker log levels accordingly (includes setting log level for a specific class).

For production workloads, and especially given current release cadence (almost once per week), it's highly recommended to:

Thanks for reporting this. For future potential issues, consider also filing a Google Cloud support case for faster path to resolution.

erhanX commented 2 years ago

Thanks for hinting at this problem and workaround. Is there any progress on a permanent fix?

When I try to use the old template, unfortunately I get connectivity issues related to TLS. On GCP side it shows in job. Sender and receiver do not have shared ciphers. From GCP side:

Error writing to Splunk: Received fatal alert: handshake_failure

From Splunk side:

WARN HttpListener - Socket error from XXX while idling: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher

Interestingly, on one of the job shells, there are common ciphers:

$ openssl s_client -connect SPLUNKSERVER -tls1_2

[...]
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384
[...]

$ openssl ciphers -s
TLS_AES_256_GCM_SHA384:[...]
rarsan commented 2 years ago

The dependent HTTP Java client library was recently upgraded to 1.40.1 which includes the fix. New template release forthcoming. Will update here.

@erhanX could you file a new issue for the error you're seeing. Please include template version, (non-sensitive) parameters values, and specific SSL cipher or cipher suite used for Splunk server certificate. For fast response, consider also filing a support ticket from your Cloud Console if you're a GCP customer.

erhanX commented 2 years ago

Thank you very much for your help. I will wait for the update. We got the latest version partially working by setting the maximum nodes for workers to 1. I had no solution for the older version and the error but I think it is obsolete for my case now anyway.

danekantner commented 2 years ago

Is this update included in the latest release 2022-02-07-00_RC01 or still forthcoming?

zhoufek commented 2 years ago

Is this update included in the latest release 2022-02-07-00_RC01 or still forthcoming?

Yes. The release contains fixes currently in the repository for all released templates. I've updated the release notes to mention this.

Since there's a few years' worth of unpublished releases, I'm not sure it's feasible to document every change specifically. Going forward, we should be capturing each change in the relevant release notes.

erhanX commented 2 years ago

Thanks for the good news and your work! Is this already available in gs://dataflow-templates/latest/Cloud_PubSub_to_Splunk ?

zhoufek commented 2 years ago

Yes, it should be.

bvolpato commented 1 year ago

The problem has been resolved.

--

This issue has been stale for some time now. Please reopen it if there is a follow up or any related questions.