Dyalog / Jarvis

APL-based web service framework supporting JSON or REST
https://dyalog.github.io/Jarvis/
MIT License
34 stars 8 forks source link

Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

Closed masternerdguy closed 10 months ago

masternerdguy commented 11 months ago

I am very new to APL, but I have been working on a proof-of-concept web application to demonstrate the potential use of Dyalog / Jarvis to others. However, I keep getting these seemingly random crashes - sometimes it only takes a minute, sometimes it takes half an hour, but Jarvis always seems to exit with a cryptic "OS / Undefined Name" error. I could be wrong, but I don't think this is related to the code I am running. I've tried hitting my endpoints with good data and error cases, but I can't find a pattern.

I've attached the error I've been receiving along with the code I am running under Jarvis. I am using the Jarvis docker container, and have included the command invoking it. Is this an issue with Jarvis itself, or just a problem with the code I'm running?

apl-birdie-app.zip output.txt Screenshot_2023-12-21_19-03-44

For what its worth, I've tried this both under WSL and Xubuntu and it seems to behave the same in both cases.

Let me know if you have any questions :)

bpbecker commented 11 months ago

Thanks for reporting this... can you give me more information about the client that's accessing Jarvis? For instance, in your example it's failing on CON00000026, which means that there were 26 previous connections (CON0000000-CON00000025). I'm trying to find a consistent reproduction of the error - so any additional information you can provide might help! Thanks again!

/Brian

masternerdguy commented 11 months ago

Brian

Here is a screenshot of the setup:

Screenshot_2023-12-23_00-05-07

As you can see, I have Firefox browsing to Jarvis on port 7777 with the "ListBirdies" endpoint selected. This endpoint receives an empty JSON payload, and I pressed the "Send" button about a dozen times (a few seconds apart each time) and then waited a couple of minutes. After waiting a couple of minutes, the error occurred.

I did notice that error contained an additional line about a socket closed while receiving data, so I restarted Jarvis and repeated the same procedure, which gave the original error:

Screenshot_2023-12-23_00-11-13

Something I have observed is that CON00000000 seems to be the most frequent one mentioned, regardless of the number of requests.

To summarize, it can be reproduced by pressing "Send" about a dozen times, waiting a few seconds between requests, with an empty JSON document on the "ListBirdies" endpoint. After doing so, and waiting, it eventually crashes.

Let me know if you have any more questions!

bpbecker commented 11 months ago

Thanks for the feedback! Curiously, this only seems to happen when Jarvis is running in the Docker container. I've been able to replicate the symptoms running the container under Docker for Windows. I can run Jarvis directly under Ubuntu, Windows, even the Raspberry Pi without this issue showing up. The investigation continues...

bpbecker commented 10 months ago

Fixed in 9cbec10 (released in v1.16.3)

This was an interesting problem - after pouring over the Jarvis code, I couldn't find anything that might cause the conditions that result in this error. I was unable to recreate it on my laptop. However, when I ran it in a container under Docker for Windows, I was able to recreate it. It turned out that it has to do with when a connection times out. The default is to time out and close a connection after 30 seconds of inactivity. What happens is:

In the container, I hacked in some additional debugging/diagnostics and was able to find that there's a delay between when we close the connection (Conga marks it for closure at that point) and when Conga has actually cleaned up its object tree to remove the connection. I believe what was happening was that we drop back into the listening loop before Conga removed the connection and when the connection is actually removed, it's signalling a "Closed" event on a connection that no longer exists. When we get a "Close" event, we remove the connection from the tracking mechanism. But since the connection was previously removed, we were getting the value error.

I was able to witness the problem occurring on another person's system that wasn't running Docker. The reason I wasn't able to reproduce the problem on my laptop (without Docker) is that my laptop is fast enough for Conga to clean things up in its own thread before we drop back into the listening loop. So, the problem was only related to Docker in that Docker container created an environment where the timing would cause the circumstances to more consistently occur.

So, there's a two-pronged approach to addressing this:

The Dyalog/Jarvis container is updated whenever we make a commit to the Jarvis master branch and as such now contains this fix.

Thanks for your help and patience! /Brian

mkromberg commented 10 months ago

I just wanted to add that, although I agree that Conga should not signal a closed event, there is a small timing gap which would allow this problem to occur very infrequently without this behaviour. If the client closed the connection in the split-second before the server decides to close the connection due to inactivity. In this case the Close event might legitimately be in the Wait queue AFTER the server has closed the connection.

masternerdguy commented 10 months ago

Just to close the loop, I can confirm the issue is no longer occurring:

Screenshot_2024-01-14_15-29-34

Thank you so much for fixing it and for the excellent explanation of the root cause!