dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.46k stars 4.76k forks source link

IPGlobalProperties throws generic Exception on 'high load' error code from WinAPI GetTcpTable() #29560

Open dbeinder opened 5 years ago

dbeinder commented 5 years ago

When the Windows TCP connections table changes rapidly, the Windows API call GetTcpTable can return STATUS_UNSUCCESSFUL (0xc0000001). This return value is currently used by .NET to throw a generic NetworkInformationException which, when printed, only displays Unknown error (0xc0000001).

GetTcpTable is used internally by .NET in the System.Net assembly: IPGlobalProperties.GetActiveTcpConnections() IPGlobalProperties.GetActiveTcpListeners()

The WinAPI documentation for GetTcpTable is short on guarantees, but recommends simply repeating the call. Can someone with deeper knowledge of the Windows API comment if it would be safe for .NET to retry until the call succeeds? If that risks major lock-ups, NetworkInformationException should be extended to give users a flag to allow them to implement their own retry logic. For now, this can be done by checking the ErrorCode of the exception manually against against 0xc0000001.

The same issue exists in .NET Core and .NET Framework, since the code is virtually identical.

Curiously, the WinApi documentation does not mention this behaviour for similar calls, like GetExtendedTcpTable, GetUdpTable, GetExtendedUdpTable which are also used by .NET - again, it would be great to know if these functions are safe, or if it is simply an oversight in the documentation.

huangbq4gh commented 5 years ago

I got the same error also in 'high load' situation (10k+ connections and frequently changing).

2019-01-08 08:23:52,658 Unknown error (0xc0000001), Traceback= at System.Net.NetworkInformation.SystemIPGlobalProperties.GetAllTcpConnections() at System.Net.NetworkInformation.SystemIPGlobalProperties.GetActiveTcpConnections()

This error is not consistent, which makes it kinda annoying to troubleshoot.

        public static int GetAvailablePort(int startingPort)
        {
            IPEndPoint[] endPoints;
            List<int> portArray = new List<int>();

            IPGlobalProperties properties = IPGlobalProperties.GetIPGlobalProperties();

            //getting active connections
            TcpConnectionInformation[] connections = properties.GetActiveTcpConnections();
            portArray.AddRange(from n in connections
                               where n.LocalEndPoint.Port >= startingPort
                               select n.LocalEndPoint.Port);
...
scalablecory commented 5 years ago

I'm betting the error you're getting is an optimistic concurrency check. If so, it should be safe to call in a loop without negatively impacting the system. Are you able to try it and see?

Will throw out some feelers to get a more specific answer.

dbeinder commented 5 years ago

I have not been successful in triggering this behavior by creating artificial loads - but I have now started a test program that will keep polling until it gets the exception and will then count the number of retries and the time it takes on a server where I have seen the issue before.

dbeinder commented 5 years ago

It took 3 days of polling GetActiveTcpConnections() for the issue to crop up again. After detecting a 0xc0000001 exception, a repeat following immediately was successful.

So I guess there are two options to resolve this:

  1. Build the repeat-after-failure behavior directly into the .NET implementation
  2. Extend NetworkInformationException to give a more actionable error message and mention in the .NET IPGlobalProperties documentation that this high load exception is possible

I believe the second option would simply lead to everyone copying boilerplate retry code whenever they want to use one of the IPGlobalProperties .GetXyzConnections() functions. I can't think of a scenario when I wouldn't want to retry immediately but rather abandon the attempt to get those connections.

If we can get a slightly stronger guarantee from someone with WinApi knowledge that retrying is not going to end up in a endless loop, I think we should definitely opt for option 1.

scalablecory commented 5 years ago

If we add a loop, we'd have it give up after some tries, so the exception might still happen. Because of this, I think documentation needs to be updated either way.

dbeinder commented 5 years ago

That would depend on how GetTcpTable is implemented behind the scenes, if it is really just multiple cores competing fairly for access to the table, then the worst case chance of 'winning' may be as high as 1/NR_OF_CORES. In that case retrying without limit might be justified if it is all but guaranteed to work out after 4-16 attempts.

If it is more complicated than that, you're right a limit or better yet a timeout would become necessary - and the case for leaving retrying up to the user becomes much stronger.

If we do expand NetworkInformationException to include a more specific flag for a high load condition - so that it isn't just an error string - that would have to be adapted on all the OS specific implementations and would make this a much larger fix.

scalablecory commented 5 years ago

if it is really just multiple cores competing fairly for access to the table, then the worst case chance of 'winning' may be as high as 1/NR_OF_CORES. In that case retrying without limit might be justified if it is all but guaranteed to work out after 4-16 attempts.

This assumes too much about the state of the system. Someone is winning, but without a fair ticketing system it might never be you. Load, priority, etc. might affect this.

CIPop commented 5 years ago

+1 I have captured a full-mem dump.

scalablecory commented 5 years ago

Looking at the code, this makes some more sense:

If you receive this return code then calling the function again is usually enough to clear the issue and get the desired result. This return code can be a consequence of the system being under high load. For example, if the size of the TCP connection table changes by more than 2 additional items 3 consecutive times.

If true, we will need to add 3 new table entries at just the right moment, 3 times in a row to trigger this. However, the code doesn't actually seem to be returning STATUS_UNSUCCESSFUL when this happens so I'm not 100% convinced this documentation is correct. Will work on a repro.

karelz commented 5 years ago

Triage: Looks like something we should document that we can throw out of that API IPGlobalProperties.GetActiveTcpConnections.

jjxtra commented 3 weeks ago

Issues still exists in .NET 8.