Closed mcolebiltd closed 2 months ago
This is possibly related to/the same issue as https://github.com/DuendeSoftware/Support/issues/1361.
Can you share more details about your architecture? What database are you using, and how are you implementing the identityserver stores (are you using our EF package?)
In the other timeouts we are investigating, this seems to be related to the system time - occurring right at 12:00 UTC. Is that also your experience?
We are using SQL Server for the database, and we are using Azure Blob Storage for data protection. We are using your EntityFramework package (Duende.IdentityServer.EntityFramework v6.1.7). This was happening approximately once per day on our dev/test environments for several weeks. We never could really pin down any pattern as far as timing of an outage. We got desperate and purged our Keys table from SQL Server, as well and our data protection blob, and so far we've gone about a week and a half with no issues.
Thanks for these details - they're very interesting. Do you have a record of the purged data, or maybe you remember a bit about how much data was deleted, or logs from back when you were experiencing the problem?
Also, I notice that the version number you mention for our EF store implementation doesn't match your identity server version. Generally we don't recommend mixing versions. In this case, it might work sort of by luck, in that we didn't happen to have any database migrations between versions. But all of our engineering and testing effort goes into combinations of packages with identical version numbers.
To be clear, I think this is a separate point because deleting your signing keys and data protection keys resolved this problem for you and because other users are seeing it as well who probably aren't mixing versions.
Thanks for these details - they're very interesting. Do you have a record of the purged data, or maybe you remember a bit about how much data was deleted, or logs from back when you were experiencing the problem?
This is actually the most important thing. Your symptoms sound like they could be related to our key management logic - the way that it unprotects data protected data, generates new keys when none are available that can be read, and delays the usage of the very first signing key to try to avoid inconsistencies when working in a scaled out environment.
So, if you have any record of how many keys existed when you purged or if data protection/cryptographic exceptions were being thrown around the time of the timeouts (in your logs), that would be very interesting to me.
We'll get our versions lined up. I did notice that our EF package wasn't the same version, but it wasn't something that we had changed recently so I didn't feel it had a strong case to be the culprit.
In our dev environment, we purged 14 keys from the Keys table in the SQL Server database and restarted the app. It immediately created a new record and now after a week there are two records in the Keys table.
I can take a look in the logs tomorrow to see what I can find, but I don't recall any exceptions. We would have noted those while we were troubleshooting the issue.
@mcolebiltd Have you been able to solve the issue? If so I would like to close it. If not: feel free to post your findings.
@RolandGuijt When we had issues, we purged our Keys
table in SQL Server and our data protection blob in Azure Storage. This did fix the issue. However, we don't really know what caused this which makes us nervous.
@mcolebiltd The problems could have been caused by the version mismatch. Is everything still running as desired at the moment?
@RolandGuijt everything is currently running as desired. I will monitor the other thread for a resolution to the issue as a whole. I don't believe the version mismatch caused any issues because it was mismatched for a year and a half before we ran into this.
Which version of Duende IdentityServer are you using? 6.2.3
Which version of .NET are you using? .NET 6
Describe the bug When opening
well-known/openid-configuration
in the browser, it returns 504 Gateway Timeout. This does not happen all the time though. Approximately once per day something happens where all requests lock and we are unable to access the certain endpoints. Our app is deployed as an Azure App Service and once we restart it, everything seems to work for another several hours until it happens again. The Log Stream on the server doesn't have anything stand out that is useful.To Reproduce
Access the
well-known/openid-configuration
endpoint via web browser.Expected behavior
To see the json rendered
Log output/exception with stacktrace
We don't see any relevant error messages.