dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.02k stars 314 forks source link

Synapse cluster has intermittent .net exception related to DNS #1092

Closed dbeavon closed 1 year ago

dbeavon commented 2 years ago

I realize it is a long-shot to post a question here about "managed private endpoints" and DNS name resolution in a synapse spark cluster.

This forum has smart people, so I thought I would at least ask.... maybe I'll eventually figure it out and post a response myself.

We use a managed private endpoint to connect to an Azure SQL database from the spark cluster (which is hosted in the synapse managed vnet). Everything works well 99.9% of the time but then I will intermittently get a network error related to something dumb like DNS. I'm not sure why the DNS name wouldn't be cached in RAM and/or work in a more reliable way. Here is the full exception:

[2022-08-23T18:23:09.8792516Z] [vm-e2f44742] [Error] [TaskRunner] [17] ProcessStream() failed with exception: System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)

---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000001, 11): Resource temporarily unavailable

at System.Net.Dns.InternalGetHostByName(String hostName) at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)

at System.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeSpan timeout) at System.Data.SqlClient.SNI.SNITCPHandle..ctor(String serverName, Int32 port, Int64 timerExpire, Object callbackObject, Boolean parallel)

at System.Data.ProviderBase.DbConnectionPool.CheckPoolBlockingPeriod(Exception e) at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection) at System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource1 retry, DbConnectionOptions userOptions) at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource1 retry) at System.Data.SqlClient.SqlConnection.Open() at UFP.DataRail.Spark.Common.Utils.Database.UfpDataVaultConnectivity.OpenNewConnection()

I suspect I'm probably connecting and disconnecting more than I should be. But I would think that SQL would be the bottleneck rather than DNS. I'm not even sure how DNS works for a "managed private endpoint". Apparently not very well. I suspect there is some throttling enforced on me, and it is totally out of my control. Maybe I just need to wait 100 ms before every round-trip SQL connection.

I currently have spark.task.maxFailures set to 1 which is probably exacerbating the impact of an intermittent DNS bug. That is what we used when the workload ran on databricks - primarily because the failures were normally related to data problems, and retries would be pointless.

Any help would be appreciated.

dbeavon commented 2 years ago

Just an update. I'm still having troubles with SQL connectivity from Spark pools in synapse. The errors I get happen intermittently while using the .Net SqlClient.

I'm working on a tech support case at the moment, which will hopefully get eventual attention from the PG team.

I strongly suspect that there is a bug in the proprietary network-tech that Microsoft uses for their VNET's, managed vnets, managed private endpoints, and private links. Unfortunately this type of thing isn't very easily googled. I am already aware of a vnet-related networking bug that is supposed to be fixed in January. I came across that networking bug in a totally separate PaaS resource. But I'm not sure if it is the same vnet-related bug that is impacting us here in Synapse spark.

I'll be playing with workarounds, and updating this issue if I learn anything. I'm pretty sure that extending my connection timeouts is a workaround for some of the errors but not all of them. In any case, the spark cluster and SQL server are in the same azure region and it is frustrating that they can't talk to each other reliably without introducing a 30 second timeout (or whatever). Another workaround that I am particularly reluctant to pursue is to wrap large chunks of my code with arbitrary retry loops (or use spark features for that purpose). This just adds complexity, and it obscures the underlying problem. If there was a "legitimate" reason for unreliable connectivity, then I'd be OK with a workaround. But I don't want to just do it because Microsoft won't fix a bug in their networking. Its not like I'm transmitting data to the moon and back. All the traffic is in East US, and components are presumably within a 1 ms distance of each other.

dbeavon commented 2 years ago

Another update. I changed some of my spark code to use SQL logins "contained users with UID/PWD" and after the change I was not seeing the same problems with the reliability of the network. Previously my authentication to Azure SQL was relying on AAD access tokens (oauth2 using a confidential-client service principal).

I'm not sure how or why the authentication strategy is impacting my ability to connect to SQL. The exception we are seeing at the client (System.Data.SqlClient) certainly doesn't indicate that the problem is related to authentication.

There may be more than one factor involved. Another factor of that may be contributing to the failures is the fact that I'm authenticating to use the resource fairly frequently (once per mapped group on several concurrent executors).

I am still working with CSS support at Microsoft to try and find an explanation for the failures. Ideally the .Net exceptions would be self-explanatory but they are not. Similarly there should be some place to monitor for whatever throttling is going on (whether intentionally or not). However SQL server seems to have low-utilization based on all the common metrics.

dbeavon commented 2 years ago

Another update. I'm still waiting on help from support (CSS).

Based on my own limited investigation, there may be a bug in systemd-resolved which is used by the ubunto distro underneath synapse-spark. The distribution is a recent version of ubunto bionic. I believe that this "systemd-resolved" is a DNS caching component ("stub resolver") and I believe it is either buggy, blocking, crashing, or all of the above.

It isn't hard to google for systemd-resolved and find lots of search results which indicate there are bugs. Although some of that might just be noise. Since I'm not a huge linux guy, I'm not sure how to filter out the noise, or determine the likelihood that one of these results is relative to spark in synapse.

dbeavon commented 2 years ago

Another update. I'm still waiting on help from support (CSS and PG).

I'm still convinced there is a bug in systemd-resolved (at least in Azure). IMHO, this is probably something that Microsoft is well aware of. Probably both the network team, and whatever team manages the ubunto image.

Interestingly, an update in this DNS component caused a large problem in Azure about a week after I first reported my DNS problems above:

https://www.theregister.com/2022/08/30/ubuntu_systemd_dns_update/

Ubuntu Linux 18.04 systemd security patch breaks DNS in Microsoft Azure

Tue 30 Aug 2022

While I'm pretty certain where the problem lies, as a Synapse user, it is very difficult for me to investigate or troubleshoot. Synapse doesn't give us the necessary rights in ubunto to investigate any logs, run sudo commands, or test things like flushing DNS, or adding my hostnames to /etc/hosts, or using another DNS service to resolve host names.

I'm still hoping CSS and PG will eventually prioritize this. If anyone has access to them, I'd appreciate help in trying to motivate them to fix this.

dbeavon commented 1 year ago

Another update. I'm still waiting on help from support (CSS and PG).

The unpredictable problems with DNS may be the fault of an outdated .net runtime (.net core 3.1) or linux (ubunto 18.04) or both. Upgrading one of these in Synapse may be the best fix. There are probably other fixes too, like editing the hosts file to include the relevant counterparties.

I'm not sure why Spark in Synapse is running on such outdated platform components. This may be something that is currently being worked on (see https://github.com/dotnet/spark/issues/1032)

I have no workarounds yet to consistently avoid the DNS problems.

It is unlikely that the Synapse environment will be upgraded to .Net 6 any time in the near future. Synapse customers are likely to remain on .Net Core 3.1 for a while. So I'm still hoping that the CSS and PG will share the root cause of the bug, and help identify a workaround.

I had a meeting with a PM on the Synapse team and he seemed somewhat interested/concerned with the fact that .Net core 3.1 is going end-of-life next week. However I don't think this will translate into an upgrade for .Net any time in the near future.

dbeavon commented 1 year ago

Still working on this.

The problems seem to stem from the problem that :

I'm guessing I will be able to recreate the problem in scala or java or python, and get someone on the synapse team to care more about this if I do so. This is probably a multi-factor problem. This is not really exclusive to .net in my opinion, (unless the .net libraries are asking for IP v6 addresses to a greater degree than other languages, since those are the ones that appear to return negative replies.)

dbeavon commented 1 year ago

Final update: (eight months later)

The problem was specific to the DNS implementation on the nodes in the Synapse cluster (on ubunto bionic beaver).

To be more specific, the Synapse PG has disabled part of the DNS cache (they are specifying a custom configuration for the "no-negcache" option in dnsmasq).

As a result, the SqlClient libraries in .Net are constantly asking for a missing IP v6 address, and those requests are constantly being sent out over the Internet. Without having a fully functional cache, a heavy-duty spark job will swamp a remote DNS service pretty quickly (especially one in Azure that has throttling in effect. )

I'm not sure why this issue didn't affect all the rest of the Synapse Spark customers. I get the feeling that the adoption of the platform is not happening very rapidly yet. (given how hard it was for me to migrate from Azure Databricks, I can see why others may be slow to migrate).

. . .

Here is the final summarization of the support case per CSS. I was asked not to provide the ICM# but the final two digits are "30". If anyone tries to re-open the same topic, you should be able to use those digits in order to confirm whether you've found the right paper trail or not.

Issue: Spark application is failing with the below error message System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000001, 11): Resource temporarily unavailable at System.Net.Dns.InternalGetHostByName(String hostName) at System.Net.Dns.GetHostAddresses(String hostNameOrAddress) at Submission#14.GetIPusingHostname(String hostname) at Submission#18.<>d__0.MoveNext()

Cause: The DNS server is getting over loaded with IPv6 traffic when we do a lookup for an address in spark VMs and this is causing the IPv4 requests to fail sometimes and the client (.NET application) is returning the above error message.

Resolution: The fix for this issue is disabling “no-negcache” at the VM level. This would only send 1 IPv6 request to the DNS.

==========

More comments from me:

This DNS issue with ubunto's dnsmasq would have affected any runtime (JVM, python, or .Net) to some degree or another. The issue was not specific to the .Net runtime. The following were all false leads, and I want to make sure that is perfectly clear:

===========

As-of now, the Synapse Spark configuration change for dnsmasq has NOT been actually been deployed yet. It is delayed for some reason (hiccup) that isn't being shared by the PG. However I have tested a stand-alone VM - both with and without the customization to the default "no-negcache" configuration. That configuration option is certainly capable of causing problems for the .Net SqlClient. I have no doubt that this unusual configuration is responsible for the troubles in our Synapse Spark cluster.

On Synapse there is little we can do to impact the environment , not even in so far as to enter some custom IP addresses in the "hosts" file. We are really at the mercy of the product team to configure the OS and languages/runtimes in a way that is appropriate.

I'm not sure exactly what would happen if a configuration that is appropriate for one particular customer is incompatible with another particular customer. I've heard rumors that there are ways for CSS and PG to "pin a VHD" to a specific Spark Pool in Synapse but I haven't seen that happen myself. I'm not sure if they do that as a matter of course, or just for exceptional cases. For the duration of this eight month case, that procedure was never actually performed on any of our Spark Pools, but the topic came up once or twice.