dataform-co / dataform-web-tracking

Dataform is a collaborative data modelling platform that enables analysts and engineers to manage complex data models in SQL
https://dataform.co
MIT License
2 stars 0 forks source link

"read ECONNRESET" error during Redshift/Snowflake runs (tracking) #205

Closed lewish closed 4 years ago

lewish commented 5 years ago

We've been investigating this issue most of today and are exploring a number of potential fixes. Note that, the issue doesn't happen when running projects via the Dataform CLI, which can be done if necessary.

This is affecting users on Redshift, particularly for projects / schedules with a lot of actions and operations.

BenBirt commented 4 years ago

Minor edit as this is also affecting users on Snowflake.

JoonasSalmelaKaleva commented 4 years ago

Any updates on this issue? Is there a workaround when using the web GUI?

dwl285 commented 4 years ago

@JoonasSalmelaKaleva we are still working on this. We rolled out a potential fix to our staging instance today. On Monday we'll check to see if this solved the problem and go from there.

We've been finding this only affects projects intermittently. Do you see it on anything other than schedule runs?

JoonasSalmelaKaleva commented 4 years ago

@dwl285 Thanks for the info! Almost all of our scheduled runs have been failing daily. Schedules need sometimes multiple manual reruns to succeed fully. Let’s hope the fix helps.

lewish commented 4 years ago

Quick update on this:

lewish commented 4 years ago

Another update:

Please let us know if you see this again and we will investigate!

Downgrading this to P2 now.

maxcountryman commented 4 years ago

Another update:

  • We are highly confident that we have no identified the issue here with our cluster, and was relating to our NAT configuration.
  • We've implemented a quick fix that has reduce incident rate greatly.
  • We have a few changes in flight to Dataform open-source that reduce the number of connections we make to hosts to reduce risk further.
  • We now have monitoring in place to detect this happening again so we can keep an eye on it.

Please let us know if you see this again and we will investigate!

Downgrading this to P2 now.

We’ve been seeing this as recently as ~30mins ago.

dwl285 commented 4 years ago

@maxcountryman I believe the errors on your project are something unrelated - I'll follow up with you separately about this.