Closed DaveWysochanskiRH closed 2 years ago
This https://github.com/abrt/retrace-server/pull/448 may help since it implements the early permissions check. In our retrace-server environment, I counted about 25% of failed tasks was due to improper permissions.
@DaveWysochanskiRH Did #448 solve this problem? Can we close this issue now?
@mgrabovsky I'm not 100% sure it's solved because it may take some time to show up again and it's been 3 weeks. I'll close for now and re-open if needed.
Unfortunately it looks like this is still a problem as I've just got another task stuck in 'available'. I'll see if I can take a look at this in the next several weeks and come up with a strategy at least on debug, if not a possible patch based on code inspection. Task only has the following files:
crash_cmd managed password remote results type url
Note that 'results' is a directory and it is empty.
Ok, I think I see what is happening at least partially. We had a couple of these 'stuck' tasks, so I went and looked at one of them. It looks like the tasks get down to the following line(s) where it returns the 'start' URL to the request:
486 elif filename and filename == "__custom__":
487 POST = request.POST
...
546 task.add_remote(POST["custom_url"])
547 task.set_managed(True)
548 task.set_url("%s/%d" % (match.group(1), task.get_taskid()))
549
550 starturl = "%s/%d/start" % (match.group(1), task.get_taskid())
551 if qs_base:
552 starturl = "%s?%s" % (starturl, "&".join(qs_base))
553
554 return response(start_response, "302 Found", "", [("Location", starturl)])
555
At that point, retrace-server / httpd expects that either the tool or the browser that initiated the original __custom__
URL will and then send the 'start' URL for the task, and so that portion of the manager will get executed. But for some reason, 'start' never gets issued. If you do this manually such as navigating to the 'start' URL (https://retrace-server.example.com/manager/123456789/start
) then the task will progress as it should have.
So next question(s) are:
retrace-server-task create
)?Hmmm, if I'm reading https://en.wikipedia.org/wiki/HTTP_302 and https://en.wikipedia.org/wiki/HTTP_303 correctly, I think we may need to change the return code on line 554 above from 302 to 303. I wonder if there's a browser that is not handling 302 correctly but trying a POST operation with the 'start' URL rather than a GET? It does not look like there is any error in the retrace-server-task create
tool, at least not one I can spot.
230 elif match.group(6) and match.group(6) == "start":
231 # start
232 GET = request.GET
Looking at the httpd logs for the last two tasks stuck in available, a couple findings:
As next steps in debugging this, I've added a new LogFormat line into apache's configuration to hopefully we can see the "Location" URL being sent back which should contain the 'start' URL that is missing. This is the line I added:
LogFormat "%h %l %u %t \"%r\" %>s %b %T %{UNIQUE_ID}e %{Location}o" custom1
Reference https://httpd.apache.org/docs/current/mod/mod_log_config.html
I'm pretty sure I know root cause of this now - it's a double button click on the "Create vmcore task" of the manager page. This is based on adding the previous log format and having a couple instances of the bug happen. In the access_log I see entries like this where 999999999 is a task that gets into this state. Note the two "POST" commands that are nearly at the same second? Only the second task 291000798 gets started properly, the first one 999999999 is just created and stuck in "available" state.
# grep -B 2 -A 2 999999999 ssl_access_log*
ssl_access_log-10.1.1.1 - - [01/Apr/2022:23:01:38 -0400] "POST /manager/__custom__ HTTP/1.1" 401 381 0 Yke8kkxzd8i@oXL1fxUjuwAAANE -
ssl_access_log-10.1.1.1 - - [01/Apr/2022:23:01:39 -0400] "POST /manager/__custom__ HTTP/1.1" 401 381 0 Yke8kz0HY1tf0ulzgrDx0wAAAFA -
ssl_access_log:10.1.1.1 - user@EXAMPLE.COM [01/Apr/2022:23:01:38 -0400] "POST /manager/__custom__ HTTP/1.1" 302 - 0 Yke8kkxzd8i@oXL1fxUjvAAAAME https://retrace.example.com/manager/999999999/start?md5sum=md5sum&debug=debug
ssl_access_log-10.1.1.1 - user@EXAMPLE.COM [01/Apr/2022:23:01:39 -0400] "POST /manager/__custom__ HTTP/1.1" 302 - 0 Yke8kz0HY1tf0ulzgrDx1AAAAFE https://retrace.example.com/manager/291000798/start?md5sum=md5sum&debug=debug
ssl_access_log-10.1.1.1 - - [01/Apr/2022:23:01:40 -0400] "GET /manager/291000798/start?md5sum=md5sum&debug=debug HTTP/1.1" 401 381 0 Yke8lD0HY1tf0ulzgrDx1QAAAFI -
Fix I think is something like this: https://www.the-art-of-web.com/javascript/doublesubmit/
I've noticed we occasionally get vmcore tasks stuck indefinitely in 'Available'. I've not been able root cause, but suspected this may be due to improper submission of NULL vmcore paths that would generate tasks that would immediately fail. I submitted PR https://github.com/abrt/retrace-server/pull/438 to reject these tasks and fail via a HTTP error code, which fixed the tasks that fail in that manner.
Unfortunately that PR didn't fully resolve the problem of "Available" tasks and we are still getting some of these tasks. One other "early failure" mode of tasks I've seen is tasks failing due to read path permissions issues on the vmcore - retrace-server's httpd or worker tasks do not have access so the task immediately fails with 'no such file or directory'.
I wonder if we could at least fixup the path problem by checking the path inside the httpd process and failing early like that PR 438. However, I think that might be more risky and end up failing tasks unnecessarily, so needs a bit more thought.
A first step in root causing why the tasks get stuck in 'available' might be some additional debugging as we don't have much info on why these get stuck there.