nyanpasu64 commented 5 months ago

Describe the bug

If I duplicate a module, then undo and redo certain operations, Darktable displays corrupted image contents in the navigation module and/or the main image view.

Steps to reproduce

Concurrency race condition

Duplicate an effect module instance (for example "color calibration" or "diffuse or sharpen").
Perform any action which leaves the duplicate copy enabled (eg. disable the original, disable an exposure module before the original and duplicate, etc.)
Undo and redo the action (Ctrl+{Z, Y}) in quick succession.
- I have not been able to replicate the bug by undoing and redoing the "duplicate instance" action (in 10-ish tests).
- The bug does not appear (in 20+ tests) if the final undo/redo action leaves the duplicate copy inactive (even if the original is active). It also does not appear (in 20+ tests) if you wait for the first undo to finish before initiating a redo (or vice versa).

Around 20% of the time, the duplicate second (top) module will become "corrupted" in such a way that whenever it's enabled, the resulting image appears garbled and filled with random colors (similar to the ECB penguin). The corruption will generally appear in either the navigation module or the main image view (randomly-ish).

Disabling the original copy of the module does not fix the problem; disabling the copy will hide the problem until it's reenabled (tested in 10+ occurrences of the bug). Altering module parameters does not fix the corruption. Undoing or redoing ends the corruption permanently (until you perform the steps again).

Enabling a uniform blend mask for the glitched module (or any previous module) changes the output of the corruption (due to floating point rounding errors, or memory buffers?), and dragging the slider changes the appearance as well.

Logic error

A more serious variant of this bug is both reproducible deterministically (rather than being a timing bug with a random success rate), and persists across program restarts. It leaves the duplicate instance permanently active and corrupted, but hidden from the UI.

Duplicate an instance (eg. "color calibration" or "exposure").
Delete the original.
Undo and redo deleting the original.
Undo deleting the original again. (BUG: it does not appear in the active module list, and does not take effect!)
Redo and undo deleting the original.

At this point, both the image and navigation preview appear corrupted. You cannot search for the duplicated module. Undoing or redoing hides the corruption, but if you close darktable while the corruption is active, the effects become "permanent".

If you change module parameters, the problem does not go away.
The duplicated module (eg. "color calibration") does not appear in the "show only active modules" tab. After restarting the program, searching for the module works for "color calibration" but not "exposure".
If you can find the original module (without the 1 in the duplicate's name), it may already be switched off.
Deleting it does not fix the corruption (as the corruption is caused by the hidden duplicate module). But if you restart after deleting the original module, the duplicate (with 1 after the name) will appear and function like normal.

Expected behavior

darktable should not corrupt images when performing undo/redo operations with duplicate modules.

Logfile | Screenshot | Screencast

darktable-log-race.txt: Triggering the race bug twice in the navigation module.

darktable-log-logic.txt: Triggering the logic error on two program runs (the first time it didn't produce visible corruption somehow), then restarting darktable and tweaking parameters.

Commit

No response

Where did you obtain darktable from?

downloaded from www.darktable.org

darktable version

darktable 4.6.1

What OS are you using?

Windows

What is the version of your OS?

Windows 11 22H2 (a bit outdated?)

Describe your system?

No response

Are you using OpenCL GPU in darktable?

Yes

If yes, what is the GPU card and driver?

RX 570 4GB, Adrenalin 23.11.1

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

Did not try other versions.
Only tested on RAW.
Reproducible on fresh edit.
Logic error sidecar: IMG_20240321_003935.dng.xmp.txt (remove .txt)
Did not try temporary config dir.
No Lua scripts used.

jenshannoschwalm commented 5 months ago

@nyanpasu64 from my understanding this might be due to wrong pixel pipe cache data in the special pipe used for duplicate and others. I believe this is already fixed in master. Would you be able to test on master? A log with '-d pipe' would be sufficient.

Maybe I didn't test correctly but was not yet able to reproduce here.

nyanpasu64 commented 5 months ago

Still happens on the latest Windows nightly 20240326.

If I corrupt a "diffuse or sharpen" effect, parts of the image turn black.

darktable_JC5o6IeLVj With blend mask enabled, black areas have stripes with a period of 4 pixels (actually 5 pixels in my screenshot because I'm running at 125% DPI scaling? Does darktable not display image data using physical display pixels?), and resizing the window changes the offset of each row of pixels relative to the previous row. Non-black pixel data seems to be passed through a pseudo-random function (in 4-pixel blocks?).

darktable-log.txt

EDIT: If I instead pull off this bug on a duplicated Exposure effect (at 100% DPI scaling), three out of every 4 pixels are replaced with a solid color (which varies randomly as I move the exposure slider of the bugged module), while the remaining pixels show the underlying image more or less (whose brightness does not change with the exposure slider): darktable_Y6ojAGNRQK

With a small enough window or high enough zoom level (so few enough pixels to calculate), I get corrupted colors (with some limited periodicity of 4 "fat" pixels) rather than a solid color: darktable_Tt24rf8Rjr

EDIT2: Bug still occurs with OpenCL off.

nyanpasu64 commented 5 months ago

In my RelWithDebInfo Windows build of darktable, interestingly I'm able to reproduce this bug with near 100% success rate (by toggling lens correction then undo/redoing, but it's a no-op because my self-built Darktable can't find the lensfun databases).

Logs at https://gist.github.com/nyanpasu64/69691066a1072168884c13259c6070b8 (see below for modified source code).

When the graphical corruption appears on screen, it seems the pixel pipe worker (dt_dev_process_image_job -> ... _dev_pixelpipe_process_rec) is erroneously called with exposure.1 as the final list element in pipe->iop (the final module passed to the outermost call to _dev_pixelpipe_process_rec()). Not sure why it happens?

Oddly enough, when I undo and redo a single operation, darktable spawns three [full] pipes, but the first two exit early because dt_iop_breakpoint(dev, pipe) is set. The second and third pipes erroneously have exposure.1 as the final operation (rather than at its correct spot in operation history).

[x] I have not yet looked into what code incorrectly builds dt_dev_pixelpipe_process() { GList *modules = g_list_last(pipe->iop) with the wrong operation order (exposure.1 on top).

Panning around the image spawns new jobs sent to worker threads (resulting in a new call to _dev_pixelpipe_process_rec_and_backcopy() and a new gRender_pass value), but share the same incorrect order of pipe->iop.

Fork with added logs at https://github.com/darktable-org/darktable/compare/master...nyanpasu64:darktable:debug-undo-pipe.

nyanpasu64 commented 5 months ago

Sometimes [worker thread] dt_dev_pixelpipe_create_nodes() (on the full or preview pipe) will clone dev->iop while [main thread] _pop_undo() is mutating it.

The (worker) clone occurs after (main) _pop_undo() assigns dev->iop = iop_temp; (according to C11+ UB rules this probably needs to be a relaxed atomic store?)...
but while (main) _pop_undo() -> dt_dev_reload_history_items() is mutating dev->iop.
- [ ] I can't find what code in dt_dev_reload_history_items() is performing this mutation (I'd probably insert more prints to narrow down what code is clashing with the worker thread). Unfortunately I may not be able to debug further tomorrow, due to life obligations.
- I do know (worker) pipe->iop = g_list_copy(dev->iop) returns an incorrect order if it executes while the main thread is running dt_dev_reload_history_items().

If the (worker) clone occurs near the end of the (main) call to dt_dev_reload_history_items(), it will erroneously see most effects as disabled (including cloned ones), but will see the correct order of list nodes (with gamma at the end, rather than all cloned nodes). This does not produce visible on-screen corruption.

Maybe you need to lock the pipe during _pop_undo() and dt_dev_pixelpipe_create_nodes() (perhaps other functions too) so workers won't see an incorrect iop (or any other parameters too) while it's being initialized or later mutated?

jenshannoschwalm commented 5 months ago

Thanks a lot for this analysis. Will have a look into all this...

nyanpasu64 commented 5 months ago

If I understand correctly, the order of dev->iop is incorrect (ends with duplicate effects rather than actual processing order):

starting from dt_dev_reload_history_items() → dt_dev_pop_history_items(dev, 0) → dt_dev_pop_history_items_ext → dt_ioppr_check_duplicate_iop_order,
and is fixed by dt_dev_reload_history_items → dt_dev_read_history[_ext] → dt_ioppr_resync_modules_order().
I did not check when the enabled states are incorrect or fixed (I'm way in over my head, and already don't know what the code is doing to the order).

I'm not sure what all these functions are doing, but I'm pretty sure the worker threads shouldn't be reading dev while dt_dev_reload_history_items() is mucking with it.

[ ] I'm not sure if all the worker code needs to be protected from edits to dev...
- Worker code includes: dt_dev_pixelpipe_process() reads modules=pipe->iop,
- and _dev_pixelpipe_process_rec reads modules->data->... which is shared with dev->iop,
or just dt_dev_pixelpipe_create_nodes() (which cares about the list's order) needs to be protected (and you hope the data is initialized by the time the pipe actually gets processed).

How would you fix it? Keep in mind I very much do not understand the current code and locking well enough to state correct fixes to this problem.

[ ] Could you eg. wrap dev in a RWlock, and the workers read-lock it when reading module order/properties, and the main thread write-locks it upon undo/redo (and GUI operations?)
- If the GUI has write-locked and is editing dev, and any worker can't read-lock (since you want full and preview to render concurrently), it early-exits (similarly to if dt_iop_breakpoint(dev, pipe) is true)?
- And if the worker has read-locked dev, and dt_dev_reload_history_items() (or some other code) wants to write to it (but fails to try-write-lock), it sets a flag for all workers to exit before blocking on a write-lock?
[ ] Can you instead allocate and initialize a new dev before sending it to workers (and never mutating this thread-shared data again), instead of mutating the one being read?
- This would prevent the UI thread from blocking on an undo/redo operation while waiting for workers to exit (which may or may not be a concern in practice, and undo/redo operations already block the UI for half a second).
- Probably a bad idea for reasons I don't understand the code well enough to know?

github-actions[bot] commented 3 months ago

This issue has been marked as stale due to inactivity for the last 60 days. It will be automatically closed in 300 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.

nyanpasu64 commented 3 months ago

The bug still occurs on the latest nightly 4.7.0+1309~g40d9f8ec7b. Is there any progress towards it being fixed?

One new symptom I've observed is that sometimes the image will not appear visually fully corrupted, but merely have the wrong color cast. Repeating the undo/redo will fix the problem as before.

jenshannoschwalm commented 3 months ago

@TurboGit i finally found time to get more into this. Some things spotted in current code i'd like to discuss - maybe bells ringing on your side. BTW i also had issues like reported here

In the logs there was this cannot get iop-order for diffuse instance 1 resulting in

153,9791 process                   CPU [full]           gamma                  (  48/   0) 1303x 956 scale=0,1713 --> (  48/   0) 1303x 956 scale=0,1713 IOP_CS_RGB
153,9832 process                   CL0 [full]           diffuse.1              (  48/   0) 1303x 956 scale=0,1713 --> (  48/   0) 1303x 956 scale=0,1713 IOP_CS_RGB

The bad boy seems to be in dt_ioppr_get_iop_order() - i had first instance deleted and here we have an error resulting in an order after gamma - INT_MAX

Could you comment on this in undo.c ? Your commit on this says 'avoid race condition'
```
#define LOCK \
dt_pthread_mutex_lock(&self->mutex); self->locked = TRUE
```

define UNLOCK \

self->locked = FALSE; dt_pthread_mutex_unlock(&self->mutex)


I think the `self->locked = FALSE` must be after the mutex unlocking because otherwise the variable would be FALSE befor being unlocked ??? Couldn't we use a recursive mutex here ?

3. I was surprised about using the non-mutex-protected variant ´dt_undo_iterate_internal()` being used in history? Could you explain?

TurboGit commented 3 months ago

About 2.

I'm not sure it would make a difference about the self->locked place. The only code in undo.c using self->locked is:

    if(!self->locked)
    {
      LOCK;

So if self->locked is FALSE the thread will try to LOCK, it will be blocked until the mutex is actually unlocked.

TurboGit commented 3 months ago

About 3.

Because the only place where dt_undo_iterate_internal() is called in inside dt_undo_iterate() and there we LOCK & UNLOCK around the call.

TurboGit commented 3 months ago

Ok, about 3 I was wrong it is also used in history.c. This is a refactoring done long time ago... Maybe we can use the locking version of this routine now?

jenshannoschwalm commented 3 months ago

Ok, about 3 I was wrong it is also used in history.c.

Sorry i wasn't more clear about my question - this is exactly what i was referring to.

nyanpasu64 commented 2 months ago

Image corruption seems fixed. I'm noticing UI lag when switching between lighttable/darkroom, or undoing/redoing. This may or may not be worse after merging the PR (I've gotten mixed results from testing before and after merging).

Switching to darkroom view hangs the UI for a full second on Windows (but not Linux). I suspect this is because GTK is slower on Windows (due to the Windows memory allocator, UCRT64 compiler, or GTK's drawing or CSS code is slower on Windows).
- If I breakpoint darktable during this hang on Windows, it usually points to a GTK stack trace, but occasionally waiting on history_mutex.
- As a complicating factor, I found that (on Windows but not Linux) using Qt Creator to debug darktable using gdb (or lldb) slows the program down significantly, especially with breakpoints enabled.
On Windows, both before and after merging the PR, I get hangs varying from 0.2 to 0.5 seconds when switching images in darkroom view, and 0.5 to 1 second when undoing or redoing operations. These appear to be a combination of blocking on history_mutex and waiting on GTK operations.
- Breaking in Qt Creator reveals that the UI is blocked on dt_dev_distort_transform_plus trying to lock history_mutex (IDK if this PR holds it for longer at a time now).
- One time on Windows, testing the "before" revealed a deadlock where the UI is blocked forever on dt_dev_pop_history_items -> dt_pthread_mutex_lock(&dev->history_mutex) while the workers never finish.
- On Linux, I've gotten mixed results in the past. In my latest testing, I'm hitting a (driver? dying GPU?) bug where OpenCL operations hang forever and hang the GUI forever during normal operation or shutdown. I'd have to reboot to test further, and I've spent enough time gathering data to delay this comment any longer.

(Besides the driver hang...) my idea is to have the UI thread cache the transformation used by dt_dev_distort_transform_plus locally, so it does not need to interact with multithreaded state to perform painting calculations. Really the ideal is to avoid sharing mutable state at all (worker threads perform image computations on state not accessed by the UI thread, and edits are sent via messages from the UI to a worker or by destroying/recreating a worker with the latest state).

Alternatively you could have the UI reject redraw operations while a render thread is holding a mutex (and reconfiguring the pipeline or whatnot).

jenshannoschwalm commented 2 months ago

Closing this as being fixed in master/4.8.1

Feel free to a) open another issue if still having problems or possibly b) a PR to discuss or implement improvements on undo/history handling :-)

darktable-org / darktable

Image corruption when undoing/redoing with duplicated modules #16498

Describe the bug

Steps to reproduce

Concurrency race condition

Logic error

Expected behavior

Logfile | Screenshot | Screencast

Commit

Where did you obtain darktable from?

darktable version

What OS are you using?

What is the version of your OS?

Describe your system?

Are you using OpenCL GPU in darktable?

If yes, what is the GPU card and driver?

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

define UNLOCK \