Closed lewish closed 4 years ago
Minor edit as this is also affecting users on Snowflake.
Any updates on this issue? Is there a workaround when using the web GUI?
@JoonasSalmelaKaleva we are still working on this. We rolled out a potential fix to our staging instance today. On Monday we'll check to see if this solved the problem and go from there.
We've been finding this only affects projects intermittently. Do you see it on anything other than schedule runs?
@dwl285 Thanks for the info! Almost all of our scheduled runs have been failing daily. Schedules need sometimes multiple manual reruns to succeed fully. Let’s hope the fix helps.
Quick update on this:
Another update:
Please let us know if you see this again and we will investigate!
Downgrading this to P2 now.
Another update:
- We are highly confident that we have no identified the issue here with our cluster, and was relating to our NAT configuration.
- We've implemented a quick fix that has reduce incident rate greatly.
- We have a few changes in flight to Dataform open-source that reduce the number of connections we make to hosts to reduce risk further.
- We now have monitoring in place to detect this happening again so we can keep an eye on it.
Please let us know if you see this again and we will investigate!
Downgrading this to P2 now.
We’ve been seeing this as recently as ~30mins ago.
@maxcountryman I believe the errors on your project are something unrelated - I'll follow up with you separately about this.
We've been investigating this issue most of today and are exploring a number of potential fixes. Note that, the issue doesn't happen when running projects via the Dataform CLI, which can be done if necessary.
This is affecting users on Redshift, particularly for projects / schedules with a lot of actions and operations.