Closed afigelj closed 2 years ago
@dotnet/http Could this be a bug in Sockets?
Exception comes from here: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs#L116 or the 5.0 version: https://github.com/dotnet/runtime/blob/release/5.0/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs#L115
// Using public SafeSocketHandle(IntPtr) a user can add the same handle
// from a different Socket instance.
Is it something ASP.NET might use? @antonfirsov @geoffkizer any further thoughts?
As an aside, I don't understand the explanation here: https://github.com/dotnet/aspnetcore/blob/18a926850f7374248e687ee64390e7f10514403f/src/Servers/Kestrel/Transport.Sockets/src/SocketTransportOptions.cs#L100-L108 How is this not setting up a potential double-free handle/fd problem?
Is it something ASP.NET might use?
We use it only for the listen socket to support systemd socket activation. @afigelj Are you calling .UseSystemd()
or .ListenHandle()
anywhere? I'm guessing not. In any case, that should only affect the listen socket, right? This is an accepted socket.
How is this not setting up a potential double-free handle/fd problem?
This does seem like a problem. Even for the systemd socket activation, the manpage specifically says
It should not invoke shutdown(2) on sockets it got with Accept=no,
https://www.freedesktop.org/software/systemd/man/systemd.socket.html#Accept=
Accept=no
is the only scenario we support 😢. In context, that might only apply to AF_UNIX sockets (Unix domain sockets) which aren't super common even in systemd socket activation scenarios. So maybe that's why we haven't had problems reported related to this. There's also the possibility of people calling KestrelServerOptions.ListenHandle() directly that don't expect us to close the listen handle on shutdown.
I didn't like this the last time it was touched either, but we'd need to rework our accept loop logic or transports themselves to use CancellationTokens to stop the loop rather than relying on closing the listen socket to do so. This is probably something we should do anyway, but I'm not aware of anyone filing issues related to this, so for now it's mostly a code correctness thing.
We use it only for the listen socket to support systemd socket activation. @afigelj Are you calling
.UseSystemd()
or.ListenHandle()
anywhere? I'm guessing not. In any case, that should only affect the listen socket, right? This is an accepted socket.No, we're not using any of this methods. Simply service/API/repository calls from the web page.
I'm not super-familiar with memory models, but shouldn't be the read of _asyncEngine
volatile here to ensure it's always happening after the lock is taken?
https://github.com/dotnet/runtime/blob/a036594cb3766f70db9f5c6e7681454e6894da50/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs#L1235-L1239
shouldn't be the read of _asyncEngine volatile here to ensure it's always happening after the lock is taken?
Entering the lock has read-acquire semantics. No reads from inside the lock can move to be before the lock.
Other (weird...) cases where people seem to hit this:
https://stackoverflow.com/questions/67912596/warn-microsoft-aspnetcore-server-kestrel0-connection-processing-ended-abnorma https://zblesk.net/blog/aspnetcore-identity-litedb-breaks-on-ubuntu/
@antonfirsov @stephentoub Are you both okay with transferring this to the runtime repo since there doesn't appear to be any usage of SafeSocketHandle(IntPtr)
in the app?
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | afigelj |
---|---|
Assignees: | halter73 |
Labels: | `area-System.Net`, `untriaged` |
Milestone: | - |
Triage: since this has been since 2.2, it's not a regression. We should investigate whether it's ours issue or not for 7.0.
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | afigelj |
---|---|
Assignees: | halter73 |
Labels: | `area-System.Net.Sockets` |
Milestone: | 7.0.0 |
This issue is happening here, fresh logs and further information at the end.
This caught me by a surprise, especially seeing that a fix is not planned for years (??).
Is there a workaround? Perhaps a patch that we could apply and keep until a proper fix is mainlined? I wouldn't have a problem doing that, we have a build system in place for other things.
Or should we drop using linux servers for .NET web applications altogether? I'm really at a loss here. Help!
BTW, found another user, @zblesk, having the same issue on Ubuntu, but he didn't follow up: https://github.com/dotnet/aspnetcore/issues/30787
I'm wondering if this issue is not far more common then it appears to be.
For example, our customer's initial error report pointed at Nginx, as the 502 error was shown there. Users might be looking in the wrong direction or they might simply give up.
Anyway, kestrel error log attached, additional information:
[user@dotnet ~]# dotnet --info
.NET SDK (reflecting any global.json):
Version: 5.0.401
Commit: 4bef5f3dbf
Runtime Environment:
OS Name: centos
OS Version: 7
OS Platform: Linux
RID: centos.7-x64
Base Path: /usr/share/dotnet/sdk/5.0.401/
Host (useful for support):
Version: 5.0.10
Commit: e1825b4928
@server-si it is not clear if it is .NET problem, or something else. We do not understand the root cause. I believe that we will need a repro, or someone willing to investigate with a repro at hand. Are you in position to reproduce the problem in your environment in some regular manner?
@server-si eventually I found out it was caused by an external library. These people were investigating something similar and also mentioned this issue because they found the related blog post, linked above. I don't know why the library was breaking the pipeline, and only on linux; I won't be investigating it, either. Just stopped using it.
Thanks @zblesk for update -- just FYI, the Twitter thread you linked is @antonfirsov above from our (Networking) team. Also, summarizing your story here: You removed library AspNetCore.Identity.LiteDB and it didn't happen since then - you describe it in your blog post. Note: It does not mean necessarily that the library is to be blamed, it may have stumbled on a bug in .NET -- something we still would need to chase down and root cause.
To reason about the root cause, we need to see the call stack and the program's state around the first handle registration that doesn't throw.
Ideally, we should be able to get this info by asking users to enabling tracing, but unfortunately we are not logging the handle value anywhere. We need to provide private builds with extended logging, or exchange debug info with customers in other ways. @server-si if you can reproduce this consistently, any chance you are willing to work with us privately?
Thanks for responding, everybody.
I'm afraid we still can't consistently reproduce the issue. We've been stressing the test server, mainly using Selenium IDE, but have so far been unable to trigger the error. The errors on the production server are seemingly random.
Since last week, the production server is down to a single error occurrence daily, if at all. The frequency of errors was significantly higher not so long ago.That's a welcome improvement but we don't fully understand what caused it nor can we be sure it is permanent.
Besides the usual moving parts such as the user interaction and the OS and the .NET updates, the application code also got a potentially significant change. I'm mainly dealing with infrastructure so I'll invite the application author to provide further details about that.
I would gladly run custom .NET builds and privately provide any other debugging information. The problem I see right now is that we can't willingly trigger the error, despite actively trying.
I just want to inform everyone, that we're extensively testing the application for the past few days on the test server. We found some clues but they are not conclusive enough for now. I'll get back when we have some more info.
You removed library AspNetCore.Identity.LiteDB and it didn't happen since then
Based on what we found in https://github.com/dotnet/runtime/issues/64305, https://github.com/fabian-blum/AspNetCore.Identity.LiteDB/pull/14 will fix the issue in AspNetCore.Identity.LiteDB.
It seems like this was external issue. We can re-open if there is evidence otherwise.
We encountered random 502 errors with Net Core 2.2. web application running on CentOS behind Nginx. Thinking the issue is related to version 2.2., we upgraded the application to NET 5.0. The errors still happen random, just in more irregular intervals.
Suggested solutions on the web for 502 errors also suggested, the error could be related to longer loading times of data from database. We optimized all queries in the API using the analysis of execution plans. The errors keep appearing. While examining the Access and Kestrel logs, we found one irregularity. Below are the Kestrel logs of two identical web page requests less than 2 minutes apart. First request returned status 200, second 502. The difference is after the line xxx.DfaMatcher[1005]. The 1st log shows the line Endpoint … is valid for the request path, but the second log is missing this entry.
As the second log shows, there were 14 seconds where Kestrel didn’t process anything (despite the user did reload the web page 4 times). The first entry is Starting HttpMessageHandler cleanup cycle. After that, the 4 web page requests produced 502, which Kestrel shows as: Connection id "xxx" communication error.
We were also surprised to see the API log, where the 4 subsequent requests from the web page (producing 502) were processed by the API with status 200. As a side note: we never encountered any similar issue on the development computer (Win 10 LTSB, Visual Studio 2019 Pro, using IIS) for the past two years.
Access.log: Response 200 213.157.249.198 - - [16/Jun/2021:08:17:18 +0200] "GET /Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62 HTTP/1.1" 200 8025 "https://my-domain.com/Admin/Measurement" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
Response 502 213.157.249.198 - - [16/Jun/2021:08:18:55 +0200] "GET /Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62 HTTP/1.1" 502 552 "https://my-domain.com/Admin/Measurement" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
Kestrel log: Response 200 Jun 16 08:17:18 .NET Core (dotnet)[551]: dbug: Microsoft.AspNetCore.Server.Kestrel[39] Jun 16 08:17:18 .NET Core (dotnet)[551]: Connection id "0HM9GAMVBTK15" accepted. Jun 16 08:17:18 .NET Core (dotnet)[551]: dbug: Microsoft.AspNetCore.Server.Kestrel[1] Jun 16 08:17:18 .NET Core (dotnet)[551]: Connection id "0HM9GAMVBTK15" started. Jun 16 08:17:18 .NET Core (dotnet)[551]: dbug: Microsoft.AspNetCore.Server.Kestrel.Https.Internal.HttpsConnectionMiddleware[3] Jun 16 08:17:18 .NET Core (dotnet)[551]: Connection "0HM9GAMVBTK15" established using the following protocol: Tls12 Jun 16 08:17:18 .NET Core (dotnet)[551]: info: Microsoft.AspNetCore.Hosting.Diagnostics[1] Jun 16 08:17:18 .NET Core (dotnet)[551]: Request starting HTTP/1.1 GET https://my-domain.com/Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62 - - Jun 16 08:17:18 .NET Core (dotnet)[551]: dbug: Microsoft.AspNetCore.StaticFiles.StaticFileMiddleware[4] Jun 16 08:17:18 .NET Core (dotnet)[551]: The request path /Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62 does not match a supported file type Jun 16 08:17:18 .NET Core (dotnet)[551]: dbug: Microsoft.AspNetCore.Routing.Matching.DfaMatcher[1001] Jun 16 08:17:18 .NET Core (dotnet)[551]: 1 candidate(s) found for the request path '/Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62' Jun 16 08:17:18 .NET Core (dotnet)[551]: dbug: Microsoft.AspNetCore.Routing.Matching.DfaMatcher[1005] Jun 16 08:17:18 .NET Core (dotnet)[551]: Endpoint 'Page: /Measurement/Detail' with route pattern 'Admin/Measurement/Detail/{id:Guid?}/{handler?}' is valid for the request path '/Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62'
Response 502
- Missing line: Endpoint 'xxx' is valid for the request path
Nginx error log: 2021/06/16 08:18:55 [error] 1883#0: *3243 upstream prematurely closed connection while reading response header from upstream, client: 213.157.249.198, server: my-domain.com, request: "GET /Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62 HTTP/1.1", upstream: "https://[::1]:6000/Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62", host: "my-domain.com", referrer: "https://my-domain.com/Admin/Measurement/Email/5aa3ce5c-612b-4202-896a-f03d20af7d62"
2021/06/16 08:18:55 [error] 1883#0: *3243 SSL_do_handshake() failed (SSL: error:140940F5:SSL routines:ssl3_read_bytes:unexpected record) while SSL handshaking to upstream, client: 213.157.249.198, server: my-domain.com, request: "GET /Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62 HTTP/1.1", upstream: "https://127.0.0.1:6000/Admin/Measurement/Detail/5aa3ce5c-612b-4202-896a-f03d20af7d62", host: "my-domain.com", referrer: "https://my-domain.com/Admin/Measurement/Email/5aa3ce5c-612b-4202-896a-f03d20af7d62"
Any ideas where to look for the cause of the issue would be appreciated.