Azure / azure-iot-sdk-csharp

A C# SDK for connecting devices to Microsoft Azure IoT services
Other
464 stars 493 forks source link

DeviceClient socket utilization issue #1098

Closed Perun85 closed 4 years ago

Perun85 commented 4 years ago

Description of the issue:

In application that serves as a gateway between PLCs and IoTHub strange socket utilization behavior is noticed after update of DeviceClient to latest version (from 1.20.1 to 1.21.1). PLCs are old and have basic HTTP support. They send messages that gateway application receives, transforms and sends to cloud. Number of messages can be from mandatory 6 up to 70 per minute (depending on conditions). Hub client is created once and stored in cache (AMQP connection pooling is not enabled). Currently there are around 100 devices connected to system.

Inspection of open sockets on server is done via TCPView. It looks like there are a lot of connection to HTTPS port apart from those "normal" ones on port 5671.

When client is reverted to version 1.20.1 socket utilization goes back to expected range.

Code sample exhibiting the issue:

I was not able to reproduce it with single client on development machine that runs Windows 10.

Screenshot

ServerSocketUtilization

timtay-microsoft commented 4 years ago

Can you provide a simplified version of the SDK code that you are using? Is it fundamentally just an application that sends Device to Cloud telemetry over AMQP? Or is it also doing other actions like twin/methods/file upload/etc

timtay-microsoft commented 4 years ago

And how often do you see this explosion of open ports? Does it consistently grow over time?

Perun85 commented 4 years ago

Only two features are used, device to cloud messages and reported properties. Here are the code samples.

Client creation

public DeviceClient CreateDeviceClient(DeviceIdentity deviceIdentity)
{
    if (deviceIdentity == null)
        throw new ArgumentNullException(nameof(deviceIdentity));

    if (deviceIdentity.Certificate == null)
        throw new ArgumentException($"Property {nameof(deviceIdentity.Certificate)} cannot be null.");

    if(string.IsNullOrEmpty(deviceIdentity.DeviceId))
        throw new ArgumentException($"Property {nameof(deviceIdentity.DeviceId)} cannot be null or empty string.");

    var authenticationMethod = new DeviceAuthenticationWithX509Certificate(deviceIdentity.DeviceId, deviceIdentity.Certificate);

    return DeviceClient.Create(configuration.Url, authenticationMethod, TransportType.Amqp);
}

Internal client implementation

public sealed class IoTHubClient : IIoTCloudClient
    {
        private bool disposed;

        private readonly DeviceClient deviceClient;
        private readonly IIoTHubMessageBuilder messageBuilder;
        private readonly ITwinCollectionBuilder twinCollectionBuilder;
        private readonly ISerializer serializer;

        public IoTCloudClientConnectionStatus ConnectionStatus { get; set; } = IoTCloudClientConnectionStatus.Disconnected;

        public IoTHubClient
        (
            IIoTHubDeviceClientFactory deviceClientFactory, 
            FridgePlcIdentity fridgePlcIdentity,
            IIoTHubMessageBuilder messageBuilder,
            ITwinCollectionBuilder twinCollectionBuilder,
            ISerializer serializer
        )
        {
            this.messageBuilder = messageBuilder;
            this.twinCollectionBuilder = twinCollectionBuilder;
            this.serializer = serializer;

            deviceClient = deviceClientFactory.CreateDeviceClient(fridgePlcIdentity);

            deviceClient.SetConnectionStatusChangesHandler((status, reason) => {
                if (status == DeviceClientConnectionStatus.Connected 
                    && reason == ConnectionStatusChangeReason.Connection_Ok)
                    ConnectionStatus = IoTCloudClientConnectionStatus.Connected;
            });
        }

        public Task OpenAsync() => deviceClient.OpenAsync();

        public Task CloseAsync() => deviceClient.CloseAsync();

        public Task InitializeAsync() => Task.CompletedTask;

        public Task SendMessageAsync<T>(T message, string messageName) where T: NotificationBase
        {
            if (message == null)
                throw new ArgumentNullException(nameof(message));

            if (string.IsNullOrEmpty(messageName))
                throw new ArgumentNullException(messageName);

            return SendMessageInternalAsync(message, messageName);
        }

        private async Task SendMessageInternalAsync<T>(T message, string messageName) where T: NotificationBase
        {
            var cloudMessage = messageBuilder.Build(message, messageName);
            await deviceClient.SendEventAsync(cloudMessage);
        }

        public Task UpdateReportedPropertiesAsync<T>(T reportedProperties)
        {
            if (reportedProperties == null)
                throw new ArgumentNullException(nameof(reportedProperties));

            return UpdateReportedPropertiesInternalAsync(reportedProperties);
        }

        private async Task UpdateReportedPropertiesInternalAsync<T>(T reportedProperties)
        {
            var serializedReportedProperties = serializer.Serialize(reportedProperties);
            var twinCollection = twinCollectionBuilder.Build(serializedReportedProperties);

            await deviceClient.UpdateReportedPropertiesAsync(twinCollection);
        }

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }

        private void Dispose(bool disposing)
        {
            if (disposed)
                return;

            if (disposing)
                deviceClient.Dispose();

            disposed = true;
        }

        ~IoTHubClient() => Dispose(false);
    }

Network utilization diagrams

Yellow - period before server was allowed to allocate more sockets Red - period when application with latest client version was running Green - period when application with DeviceClient 1.20.1 was running

ServerNetworkUtilization

Here is a bit more detailed view of one of those periods when application with latest client version was running on server Details

timtay-microsoft commented 4 years ago

Are you tracking the connection status callback at all during this? If so, are you seeing any disconnected_retrying events?

davilu commented 4 years ago

@Perun85 AMQP port is 5671 not HTTPS(443). If add a filter of remote port 5671, you should able to see exactly 100 TCP connections which used by 100 device client. The question is where does those HTTPS ports came from? Are you using ProvisioningServiceClient? Do you mind to share IIoTHubDeviceClientFactory .CreateDeviceClient?

Perun85 commented 4 years ago

First I will try to describe the life cycle of our client, which is just thin wrapper around DeviceClient instance.

We create it once, upon initial message from PLC. At that point we call OpenAsync, send message, and place it in the cache (without closing it). When subsequent messages from same PLC come, we just take that client from cache, check its internal connection status (If it is disconnected we dispose of it and create new one) and send message.

There is small window of opportunity for two or more clients to be connected at the same time. When we receive two or more messages from one PLC in time span of 20 to 50 ms. That situation is handled by the cache. For each client it has only one slot, and on eviction of one instance (when client created in last message is placed in its slot) previous one gets evicted and its Dispose method is called. This is reason why we chose AMQP, because multiple connections from same device are possible (correct me if I am wrong).

So far with previous versions of DeviceClient this was working fine. When I updated it from 1.20.1 to 1.21.1 I started to notice this strange socket utilization issue.

@timtay-microsoft Today, I added logging to handler for connection status changed event but I haven't noticed disconnected_retrying states on multiple clients only on one retry_expired. That one is not yet registered in our CRM and in IoTHub.

@davilu

AMQP port is 5671 not HTTPS(443). If add a filter of remote port 5671, you should able to see exactly 100 TCP connections which used by 100 device client

NormalUtilization

This is normal socket utilization status when DeviceClient 1.20.1 is used. As expected it has one socket for every connection to remote 5671. (I didn't add filter so total count has included sockets used by other applications on server)

The question is where does those HTTPS ports came from?

That is my question as well :)

Are you using ProvisioningServiceClient?

No

Do you mind to share IIoTHubDeviceClientFactory .CreateDeviceClient?

public DeviceClient CreateDeviceClient(DeviceIdentity deviceIdentity)
{
    if (deviceIdentity == null)
        throw new ArgumentNullException(nameof(deviceIdentity));

    if (deviceIdentity.Certificate == null)
        throw new ArgumentException($"Property {nameof(deviceIdentity.Certificate)} cannot be null.");

    if(string.IsNullOrEmpty(deviceIdentity.DeviceId))
        throw new ArgumentException($"Property {nameof(deviceIdentity.DeviceId)} cannot be null or empty string.");

    var authenticationMethod = new DeviceAuthenticationWithX509Certificate(deviceIdentity.DeviceId, deviceIdentity.Certificate);

    return DeviceClient.Create(configuration.Url, authenticationMethod, TransportType.Amqp);
}
davilu commented 4 years ago

@Perun85 I didn't see any HTTPS port open in my test. Could you please use console command to check TCP connections:

// list all established HTTPS connection to your hub
netstat -na | find "your_hub_ip:443" | find "ESTABLISHED"
// count established HTTPS connection to your hub
netstat -na | find "your_hub_ip:443" | find "ESTABLISHED" /C
// list all established AMQP connection to your hub
netstat -na | find "your_hub_ip:4567" | find "ESTABLISHED"
// count established AMQP connection to your hub
netstat -na | find "your_hub_ip:5671" | find "ESTABLISHED" /C

Meanwhile could you please capture ETL log? Please refer to https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs

Perun85 commented 4 years ago

@davilu

Meanwhile could you please capture ETL log

Here are the socket counts and event statistics.

DeviceClient 1.20.1 SocketCountNormalUtilization EventStatisticsNormalUtilization

DeviceClient 1.21.1 SocketCountDuringIssue EventStatisticsDuringIssue

Socket count for DeviceClient 1.21.1 is taken approximately 90 seconds before process crashed.

Would you like me to upload ETL files? (compressed they are around 250 MB)

davilu commented 4 years ago

@Perun85 do you have event for "Microsoft-Azure-Devices-Device-Client"? We'd like to see what's going on in the SDK not socket event.

This is reason why we chose AMQP, because multiple connections from same device are possible (correct me if I am wrong). What do you mean? There should be only one instance for the same device id for any protocol. AMQP has pooling support which allows multi devices sharing the same AMQP connection.

davilu commented 4 years ago

@Perun85 Please don't share any personal information here including IP address, HUB name or authentication info. To upload log files please create a support request following https://docs.microsoft.com/en-us/azure/azure-supportability/how-to-create-azure-support-request

Perun85 commented 4 years ago

@davilu

Do you have event for "Microsoft-Azure-Devices-Device-Client"? We'd like to see what's going on in the SDK not socket event.

Sorry, I assumed you wanted to see what's going on with sockets. Currently it is night in my timezone, and PLCs are unattended. I will reproduce issue in the morning and get back to you with results.

What do you mean? There should be only one instance for the same device id for any protocol. AMQP has pooling support which allows multi devices sharing the same AMQP connection.

It is possible to have multiple instances of DeviceClient over AMQP connected as same device. This is not possible over MQTT. In this solution it is used to resolve rare race condition cases without loosing messages. Periods while we have two clients active for single device are bellow 300 ms and as I said they are really rare. This does not mean we are using one registered device for every PLC. Each one has its own.

Perun85 commented 4 years ago

Here are entries for InternalClient from Enter section.

Incident

InternalClientDuringIncident

Normal

InternalClientNormal

During incident it is noticeable that a lot of internal clients get Disconnected.Retry_Expired, something that does not occur during normal execution.

Does fact that we are using CA signed certificates for authentication play significant role? Should I check for any specific event from Microsoft-Azure-Devices-Device-Client provider?

davilu commented 4 years ago

@Perun85 The ETL log looks different from the sample code. Do you have any customized retry policy configured? And there were 2 transport settings to device client. Could you please double check what they are? Could you please share error message events as well?

Perun85 commented 4 years ago

@davilu

The ETL log looks different from the sample code.

There is actually one difference, @timtay-microsoft asked me to check if I have clients in disconnected_retrying state. Because of it, I added else clause and call to LogInformation method. Apart from it, everything is the same.

 deviceClient.SetConnectionStatusChangesHandler((status, reason) => {
                if (status == DeviceClientConnectionStatus.Connected  && reason == ConnectionStatusChangeReason.Connection_Ok)
                    ConnectionStatus = IoTCloudClientConnectionStatus.Connected;
                else
                    logger.LogInformation($"Connection status: {status} Reason: {reason} DeviceId: {deviceId}");
});

Do you have any customized retry policy configured?

No, there are no customized retry policies.

And there were 2 transport settings to device client. Could you please double check what they are

I use this overload of DeviceClient.Create method

 DeviceClient Create(string hostname, IAuthenticationMethod authenticationMethod, TransportType transportType);

I don't pass any TransportSettings myself,

Could you please share error message events as well?

Client 1.20.1 All of those Microsoft.Azure.Devices.Client.Exceptions.UnauthorizedException are originating from one PLC that is not registered in our CRM and corresponding test IoTHub.

image

Client 1.21.1 Here there is new exception type Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException with following message condition:amqp:unauthorized-access. That unregistered PLC does not have enough processing power to produce this amount of messages over given sampling period. image

davilu commented 4 years ago

@Perun85 There is 2 bugs in release 1.21.1 might cause the issue. One of them is incorrectly mapped exception to IotHubCommunicationException which will cause device client unexpected to keep reconnecting. The other one is no operation will be accepted during reconnecting. The fix is already checked in to master and we have a release soon(ETA by end of this week). As a work around, you could call device client Dispose function instead of CloseAsync in the cleanup code. It will be well appreciated if you would like to have a try with our master code without work around.

Perun85 commented 4 years ago

@davilu

This morning I checked out the master branch and built the DeviceClient from it. Everything else remained the same. After deploying it, it is noticeable that we don't have HTTPS connections to IoTHub any more, but the count of AMQP connections was increasing as the time was passing by.

After that, small code change was introduced. I went with your suggestion and use Dispose instead of CloseAsync, but this unfortunately didn't help.

On test environment today we had around 90 active PLCs and after three hours of running the application (one with Dispose instead of Close) we ended up with 493 AMQP connections.

image

image

Would you like me to collect performance logs?

davilu commented 4 years ago

@Perun85 Yes please collect ETL log and send me a screen shot of exceptions. Meanwhile could you please also share the code to manage device client instances including creating and evicting device client instance. The last option could try with is disable retry policy by calling deviceClient.SetRetryPolicy(new NoRetry()) to avoid duplicated device client instances playing pingpong, but the side effect is SDK will not retry any operations internally. You might get more failure if no retry policy is applied. By the way, seems root cause of this issue is you have a expired X509 certificate in the system, if it get fixed, the issue shouldn't happen.

Perun85 commented 4 years ago

@davilu

Certificate is not expired, but you are correct unregistered device caused this issue. I was able to reproduce it with simple test app that tries to connect to IoTHub over AMQP with non existing device ID using certificate authentication.

DeviceClient 1.21.1 When this version was used I was able to detect sockets with remote endpoint connected to 5671 and 443.

DeviceClient (latest from master) When this version was used I didn't get 443 connections but even after Dispose was called on deviceClient instance, socket remained opened and connected to remote hosts 5671 port.

Here is the code I used

static async Task Main(string[] args)
        {
            foreach(var i in Enumerable.Range(1, 20))
            {
                Task.Run(async () => {

                    var cert = new X509Certificate2("YourCert.pfx", "YourCertPass");

                    var authentication = new DeviceAuthenticationWithX509Certificate("WrongDevice", cert);
                    DeviceClient deviceClient = DeviceClient.Create("YourHubUrl", authentication, TransportType.Amqp);
                    try
                    {
                        await deviceClient.OpenAsync();
                        await deviceClient.CloseAsync();
                        deviceClient.Dispose();

                        Console.WriteLine($"No exception detected.");
                    }
                    catch (Exception)
                    {
                        deviceClient.Dispose();
                        Console.WriteLine("Exception detected.");
                    }
                });

                await Task.Delay(1000);
            }

            Console.ReadKey();            
        } 
davilu commented 4 years ago

@Perun85 Thanks for sharing the sample. I'll re-pro the issue and give you an update shortly.

davilu commented 4 years ago

@Perun85 Thanks for the help to address this issue. I've already found the root cause and have a fix ready on my fork. I've done a smoke test and it fixed TCP leak. Please have a try. We've already started release process yesterday, so we can't promise the fix will be squashed into this release which is device client 1.21.2. I'll let you know how it goes by end of the day. The code with fix is here: https://github.com/davilu/azure-iot-sdk-csharp/tree/fix-AMQP-connection-leak

Perun85 commented 4 years ago

@davilu Thank you for not giving up.

I built DeviceClient from this fork https://github.com/davilu/azure-iot-sdk-csharp/tree/fix-AMQP-connection-leak and the issue manifested itself again . It starts again to open https connections to IoTHub, which is not the case when I built it from current master branch.

For me it is not that important that we have fix in next release. We can always work with previous version of library.

davilu commented 4 years ago

@Perun85 My test shows no 443 or 5671 port leaks. Could you please double check what branch were you using from my fork? Please use command git branch -vv to check the branch is 'fix-AMQP-connection-leak', not master. We'll keep you updated. Thanks for the support.

davilu commented 4 years ago

@Perun85 The fix has been merged into master and will be released as 1.21.2.

Perun85 commented 4 years ago

The issue with my test was related to cached version of NuGet package.

This morning, I used latest version of DeviceClient on our test environment and system was behaving as expected. Socket utilization returned to its normal range.

If you are OK with it, I would like to close this ticket.

Thanks a lot for the help.

davilu commented 4 years ago

@Perun85 Thanks for your support. I'm glad the issue get addressed. The device client release 1.21.2 is on its way.

az-iot-builder-01 commented 4 years ago

@Perun85, @davilu, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey