vmcore tasks permanently stuck in 'available' status on manager page for unknown reason

DaveWysochanskiRH commented 2 years ago

I've noticed we occasionally get vmcore tasks stuck indefinitely in 'Available'. I've not been able root cause, but suspected this may be due to improper submission of NULL vmcore paths that would generate tasks that would immediately fail. I submitted PR https://github.com/abrt/retrace-server/pull/438 to reject these tasks and fail via a HTTP error code, which fixed the tasks that fail in that manner.

Unfortunately that PR didn't fully resolve the problem of "Available" tasks and we are still getting some of these tasks. One other "early failure" mode of tasks I've seen is tasks failing due to read path permissions issues on the vmcore - retrace-server's httpd or worker tasks do not have access so the task immediately fails with 'no such file or directory'.

I wonder if we could at least fixup the path problem by checking the path inside the httpd process and failing early like that PR 438. However, I think that might be more risky and end up failing tasks unnecessarily, so needs a bit more thought.

A first step in root causing why the tasks get stuck in 'available' might be some additional debugging as we don't have much info on why these get stuck there.

DaveWysochanskiRH commented 2 years ago

This https://github.com/abrt/retrace-server/pull/448 may help since it implements the early permissions check. In our retrace-server environment, I counted about 25% of failed tasks was due to improper permissions.

mgrabovsky commented 2 years ago

@DaveWysochanskiRH Did #448 solve this problem? Can we close this issue now?

DaveWysochanskiRH commented 2 years ago

@mgrabovsky I'm not 100% sure it's solved because it may take some time to show up again and it's been 3 weeks. I'll close for now and re-open if needed.

DaveWysochanskiRH commented 2 years ago

Unfortunately it looks like this is still a problem as I've just got another task stuck in 'available'. I'll see if I can take a look at this in the next several weeks and come up with a strategy at least on debug, if not a possible patch based on code inspection. Task only has the following files: crash_cmd managed password remote results type url Note that 'results' is a directory and it is empty.

DaveWysochanskiRH commented 2 years ago

Ok, I think I see what is happening at least partially. We had a couple of these 'stuck' tasks, so I went and looked at one of them. It looks like the tasks get down to the following line(s) where it returns the 'start' URL to the request:

486     elif filename and filename == "__custom__":
487         POST = request.POST
...
546         task.add_remote(POST["custom_url"])
547         task.set_managed(True)
548         task.set_url("%s/%d" % (match.group(1), task.get_taskid()))
549 
550         starturl = "%s/%d/start" % (match.group(1), task.get_taskid())
551         if qs_base:
552             starturl = "%s?%s" % (starturl, "&".join(qs_base))
553 
554         return response(start_response, "302 Found", "", [("Location", starturl)])
555

At that point, retrace-server / httpd expects that either the tool or the browser that initiated the original __custom__ URL will and then send the 'start' URL for the task, and so that portion of the manager will get executed. But for some reason, 'start' never gets issued. If you do this manually such as navigating to the 'start' URL (https://retrace-server.example.com/manager/123456789/start) then the task will progress as it should have.

So next question(s) are:

Is this someone's browser, or is it a commandline tool (such as retrace-server-task create)?
What goes wrong that causes the 'start' URL not to be sent to retrace-server?

DaveWysochanskiRH commented 2 years ago

Hmmm, if I'm reading https://en.wikipedia.org/wiki/HTTP_302 and https://en.wikipedia.org/wiki/HTTP_303 correctly, I think we may need to change the return code on line 554 above from 302 to 303. I wonder if there's a browser that is not handling 302 correctly but trying a POST operation with the 'start' URL rather than a GET? It does not look like there is any error in the retrace-server-task create tool, at least not one I can spot.

230     elif match.group(6) and match.group(6) == "start":
231         # start
232         GET = request.GET

DaveWysochanskiRH commented 2 years ago

Looking at the httpd logs for the last two tasks stuck in available, a couple findings:

Both tasks have a remote IP address, which indicates that the 'manager' page was probably used, not a commandline tool.
One task there was a 'start' URL with GET request type, but the second one there was not. For the task with the start URL, it is not clear what happened and why the task did not start but it must have failed somewhere inside the 'start' portion of the 'manager.wsgi' code. Unfortunately it's not clear why it must have failed in actually starting the task. For the task that did not have the 'start' URL, it is not clear what happened.

DaveWysochanskiRH commented 2 years ago

As next steps in debugging this, I've added a new LogFormat line into apache's configuration to hopefully we can see the "Location" URL being sent back which should contain the 'start' URL that is missing. This is the line I added: LogFormat "%h %l %u %t \"%r\" %>s %b %T %{UNIQUE_ID}e %{Location}o" custom1

Reference https://httpd.apache.org/docs/current/mod/mod_log_config.html

DaveWysochanskiRH commented 2 years ago

I'm pretty sure I know root cause of this now - it's a double button click on the "Create vmcore task" of the manager page. This is based on adding the previous log format and having a couple instances of the bug happen. In the access_log I see entries like this where 999999999 is a task that gets into this state. Note the two "POST" commands that are nearly at the same second? Only the second task 291000798 gets started properly, the first one 999999999 is just created and stuck in "available" state.

# grep -B 2 -A 2 999999999 ssl_access_log*
ssl_access_log-10.1.1.1 - - [01/Apr/2022:23:01:38 -0400] "POST /manager/__custom__ HTTP/1.1" 401 381 0 Yke8kkxzd8i@oXL1fxUjuwAAANE -
ssl_access_log-10.1.1.1 - - [01/Apr/2022:23:01:39 -0400] "POST /manager/__custom__ HTTP/1.1" 401 381 0 Yke8kz0HY1tf0ulzgrDx0wAAAFA -
ssl_access_log:10.1.1.1 - user@EXAMPLE.COM [01/Apr/2022:23:01:38 -0400] "POST /manager/__custom__ HTTP/1.1" 302 - 0 Yke8kkxzd8i@oXL1fxUjvAAAAME https://retrace.example.com/manager/999999999/start?md5sum=md5sum&debug=debug
ssl_access_log-10.1.1.1 - user@EXAMPLE.COM [01/Apr/2022:23:01:39 -0400] "POST /manager/__custom__ HTTP/1.1" 302 - 0 Yke8kz0HY1tf0ulzgrDx1AAAAFE https://retrace.example.com/manager/291000798/start?md5sum=md5sum&debug=debug
ssl_access_log-10.1.1.1 - - [01/Apr/2022:23:01:40 -0400] "GET /manager/291000798/start?md5sum=md5sum&debug=debug HTTP/1.1" 401 381 0 Yke8lD0HY1tf0ulzgrDx1QAAAFI -

Fix I think is something like this: https://www.the-art-of-web.com/javascript/doublesubmit/

abrt / retrace-server

vmcore tasks permanently stuck in 'available' status on manager page for unknown reason #446