TrilbyWhite / Slider

PDF presentation tool
GNU General Public License v3.0
54 stars 12 forks source link

Race between cleanup() in main thread and render_threaded() #7

Closed luksow closed 11 years ago

luksow commented 11 years ago

Hi,

I believe that there is a minor race condition between cleanup() and render_threaded(). For quite large PDF (ex. 80 pages), if you open it and then close it immediately there is a chance that cleanup() will be invoked during render_threaded() which leads to various kinds of SEGV.

It happens only when quitting, so it is not urgent as it just prints garbage and do not do anything harmful I suppose. However, I noticed that you pay attention to code quality, so I thought you should know about it :).

Fixing it is somehow difficult. Joining main thread to render_threaded() in cleanup() is rather bad idea as it will unnecessarily slow down quitting (with large PDFs rendering), making users angry :). I would terminate rendering loop gracefully (with simple flag) and then join threads.

Anyway it is a minor problem, so I won't be unhappy if you close this ticket without fixing ;)

BTW. Slider is very nice piece of software! It is a pity that I have not found it before! BTW2. I found this bug while trying my MSc thesis tool for testing concurrent programs, so you may fill a bit like a guinea pig ;).

TrilbyWhite commented 11 years ago

Thanks!

I have been recent revising the cleanup code as it was causing many such problems - I'm not surprised that I missed some.

I think I will implement something with the flag suggestion. I need to more cleanly separate what the rendering thread and the main thread should each be responsible for freeing. It seems simple enough, but there are several cases that require specific checks.

TrilbyWhite commented 11 years ago

And I'm also happy to be a guinea pig. Please do let me know if you find other ways to make slider more robust.

It is in active development, and so likely full of bugs - but I aim to crush them all.

TrilbyWhite commented 11 years ago

I had pushed some revisions to the cleanup function shortly before you posted this bug. Before thoe revisions I was seeing the errors you describe, but with the revisions I am unable to replicate the error.

Can you test again with the most recent update?

luksow commented 11 years ago

First of all, thanks for being interested!

I am pretty sure that I tested most recent version from master branch (and I have just checked it) - that is how I found your project ;). This bug is hard to observe because time span between cleanup() and return 0 in main thread is extremely short. After main thread quits, render_threaded() is automatically killed so most of times it will not have enough time to protest. Bug manifestation is basically dependant on scheduler and number of CPUs... So it is hard to reproduce & debug.

Good approximiation of this bug you will get if you:

  1. Add sleep() just after cleanup() call - it will simulate (really long) thread switch which is not impossible.
  2. Compile.
  3. Open quite big PDF ~ 2 megs should be enough but it is also machine dependant.
  4. CTRL+Q immediately after Slider starts.
  5. Voila!

With steps above I am able to reproduce it everytime. Still, remember that its occurrence is rather rare so this fix is mostly about being pedantic.

TrilbyWhite commented 11 years ago

Thanks, now I can replicate it - a fix should be on the way soon.

luksow commented 11 years ago

Great, I am glad I could help :).

TrilbyWhite commented 11 years ago

I found the source of the problem, and I believe I fixed it (can no longer replicate here). Please test.

TrilbyWhite commented 11 years ago

Eh, nevermind ... I hope no one actually saw the code that I thought was a solution. That has to be one of the dumbest things I've done.

It did avoid the problem under discussion, but it created a much worse problem.

Perhaps there's some redemption in getting it right the second time. I've added pthread_cancel + pthread_join calls which should do the trick.

luksow commented 11 years ago

I saw it, but did not have time to test it :). Anyway, what was the problem with it?

I find pthread_cancel to be rather tricky to use. So at least be cautious.

TrilbyWhite commented 11 years ago

The foolish 'solution' was to have a variable I called render_cancel (I think that's what I had called it). It would start at zero, then the cleanup function in render.c would set it to 1 then wait for it to be set to two. The rendering thread would check if it was set to 1, and if so it would set it to 2 and quit.

This worked perfectly as a solution to the problem. But then it occured to me: what would happen if one tried the much more common case of closing the program after rendering was done? If there is no rendering thread still running, then the variable would never be set to 2 and the while loop waiting for it would go on indefinitely.

luksow commented 11 years ago

Ah sure! I did not notice it at first ;).

I saw that you used pthread_cancel + pthread_join and I think it should work. I will do some heavy testing next week and let you know.

luksow commented 11 years ago

Hi again,

I believe that problem in question is now fixed (still, I will do some more tests on this matter).

However, I am a bit concerned about pthread_cancel you used. First of all, I am not sure where are cancellation points in render_threaded() but I suppose they may be hidden somewhere in library calls (cairo etc.) - if there is a cancellation after line 74: (https://github.com/TrilbyWhite/Slider/blob/de06e7c6047c84b28239afbe749cae7009b279de/render.c#L74) but before line 82 (https://github.com/TrilbyWhite/Slider/blob/de06e7c6047c84b28239afbe749cae7009b279de/render.c#L82) you end up with memory leak because some freeing functions will not be called (like cairo_surface_destroy etc.). Similar problem occurs in "sorter part". I run slider under valgrind and with "early exit" it reports some memory leaks which size is random (from few hundred bytes to megabytes even). Second thing is worse. After many tries I discovered that early exiting may cause slider to deadlock. It is rare (1/10 of early exits or something) but it occured to me a few times, especially when I nervously tapped CTRL+Q :). I took gdb backtrace from such situation (sorry for quality, but photo was easiest and fastest way with no X hung): 2013-05-27 04 26 41 It looks like there is locked mutex in XFreeGC(). I do not know why, but I suppose that it may be connected with pthread_cancel. If cancellation occurs somewhere in XLib with mutex locked it may remain that way. I am not sure about it, because I am not familiar with pthread_cancel, but check it out.

What I suggest to do is to drop pthread_cancel and use flag instead but with pthread_join (to avoid deadlocking :)). I think it will solve both problems.

If there is anything I could do, do not hesitate to ask :).

TrilbyWhite commented 11 years ago

I just got around to revising this - there actually wasn't much too it. I've used a flag and pthread_join.

Feel free to test and let me know if this works better.

luksow commented 11 years ago

Hey, you broke build because STOP_RENDER is undefined. However, if I define it, it works correctly and I think that all problems seem to go away. I am closing this bug.

Feel free to contact me if you have any more questions.

TrilbyWhite commented 11 years ago

Oh (&*#)$, yes, I called it END_RENDER in slider.h

Fixed now.