Closed kmccurley closed 1 year ago
I am unable to recreate this on my desktop - it only seems to happen in the wsgi environment of publish.iacr.org. I suspected that it had something to do with how the ThreadPoolExecutor is managed within the wsgi processes. I switched the line
WSGIDaemonProcess iacrpub processes=1 threads=20
so that there is only one process with 20 threads, but it still occurred. We may want to switch ThreadPoolExecutor to use flask-Executor instead, so that we can inherit the flask context and write to the log inside the thread that is running the task.
What I have noticed is that the task_queue doesn't have an entry, which makes us think that it should have finished. The error message "status is unknown" is produced in the live code (not checked in).
This bug is driving me nuts. I've just submitted 25 times in a row running it on my desktop, using different cls files, different zip files, some of which compile and some of which do not. I now think it only occurs on publish.iacr.org because of something in mod_wsgi/apache or docker.
I still haven't found a solution to this. Today I thought maybe it was because when python writes to a file, the OS may still buffer the output. I switched the writing of the compilation.json file so that it uses os.fsync() on the file descriptor of the output file before popping the paperid from the task_queue. According to my thinking, this should mean that the file is written out before the web server thinks it should refer to the compilation.json file for definitive information about what happened with the compilation. Unfortunately I still saw the same error - when the web server responds to /tasks, it reads an old version of the file.
It's hard for me to see how it could happen, but maybe it involves the scheduling of IO operations in flask/mod_wsgi/python. According to this document when python performs an IO operation, it releases the Global Interpreter Lock (GIL) that allows other threads to run. That still doesn't explain how the paperid could be removed from the task_queue before the write operation completes, because the line task_queue.pop(paperid, None)
appears at the end of the task performed for the compilation.
The design should perhaps be reviewed by an independent party (Joppe?). The general idea is that task_queue is a global that stores a map from paperid to the Future that is executing the compilation. We need to keep this around to make sure that you can't issue two compilations for the same paper, and to keep a queue of jobs that have been queued for compilation. The ThreadPoolExecutor has only a single thread, so it can only compile one paper at a time (this is to constrain the load on the server, and queue the jobs up). When the task_queue Executor runs a job, it executes tasks.py::run_latex_task(), and the last thing that does it to pop the paperid from the task_queue. There are theoretically other places where the paperid is popped from the task_queue, but only if the thread dies for some reason (e.g., if an unexpected exception occurred or the task was cancelled). I have never seen these other cases occur - they are only there for extra safety.
Unfortunately when the browser polls for the paperid, it doesn't show up in the task_queue any more because it was popped. I had assumed that would mean the write() in run_latex_task() would have completed and flushed the output through the OS. There is evidently some other optimization going on that prevents the wsgi server from seeing the latest version of the file.
One thing we could try is to remove the task_queue and use some other file-locking mechanism to protect against double compilation. That seems more dangerous to me.
I've now observed this in the wild on publish.iacr.org. I looped in ls -ltr data/
I will try and review this as well. I just tried to submit and got the error on the first try: Paper was not compiled: no directory /var/www/wsgi/latex-submit/webapp/data/i6zp9d6/candidate/output. This is a known bug and you should try reloading this page. My gut feeling is also scheduling of IO operations (either by Docker or by something else). Let me see if I can find anything...
I think the problem is after docker runs, because the shutil.copytree copies staging to output, and the compilation.json file is written after docker. I had a longer explanation in email of my suspicions.
I think this was fixed by adding a database to keep track of the compilations.
there appears to be a race condition in the submission flow. While the user is waiting for their paper to compile, the browser is polling /tasks/. This polls the task_queue to see if a task is still running. The task is running tasks.py::run_latex_task, and when it finishes it should have created the output directory under data//output. When the browser detects that the task is finished, it redirects to /view/, which looks for the output directory but doesn't find it. If you refresh a few seconds later, it finds it.
The path to create the output directory is complicated:
Unfortunately I sometimes notice that /view/ does not find the output_dir. We should probably try to do better logging to see what is going on, but accessing the flask app logger from inside a thread of the executor seems difficult.