darktable-org / darktable

darktable is an open source photography workflow application and raw developer
https://www.darktable.org
GNU General Public License v3.0
9.52k stars 1.13k forks source link

Image corruption when undoing/redoing with duplicated modules #16498

Closed nyanpasu64 closed 2 months ago

nyanpasu64 commented 5 months ago

Describe the bug

If I duplicate a module, then undo and redo certain operations, Darktable displays corrupted image contents in the navigation module and/or the main image view.

Steps to reproduce

Concurrency race condition

  1. Duplicate an effect module instance (for example "color calibration" or "diffuse or sharpen").
  2. Perform any action which leaves the duplicate copy enabled (eg. disable the original, disable an exposure module before the original and duplicate, etc.)
  3. Undo and redo the action (Ctrl+{Z, Y}) in quick succession.
    • I have not been able to replicate the bug by undoing and redoing the "duplicate instance" action (in 10-ish tests).
    • The bug does not appear (in 20+ tests) if the final undo/redo action leaves the duplicate copy inactive (even if the original is active). It also does not appear (in 20+ tests) if you wait for the first undo to finish before initiating a redo (or vice versa).

Around 20% of the time, the duplicate second (top) module will become "corrupted" in such a way that whenever it's enabled, the resulting image appears garbled and filled with random colors (similar to the ECB penguin). The corruption will generally appear in either the navigation module or the main image view (randomly-ish).

Disabling the original copy of the module does not fix the problem; disabling the copy will hide the problem until it's reenabled (tested in 10+ occurrences of the bug). Altering module parameters does not fix the corruption. Undoing or redoing ends the corruption permanently (until you perform the steps again).

Enabling a uniform blend mask for the glitched module (or any previous module) changes the output of the corruption (due to floating point rounding errors, or memory buffers?), and dragging the slider changes the appearance as well.

Logic error

A more serious variant of this bug is both reproducible deterministically (rather than being a timing bug with a random success rate), and persists across program restarts. It leaves the duplicate instance permanently active and corrupted, but hidden from the UI.

  1. Duplicate an instance (eg. "color calibration" or "exposure").
  2. Delete the original.
  3. Undo and redo deleting the original.
  4. Undo deleting the original again. (BUG: it does not appear in the active module list, and does not take effect!)
  5. Redo and undo deleting the original.

At this point, both the image and navigation preview appear corrupted. You cannot search for the duplicated module. Undoing or redoing hides the corruption, but if you close darktable while the corruption is active, the effects become "permanent".

Expected behavior

darktable should not corrupt images when performing undo/redo operations with duplicate modules.

Logfile | Screenshot | Screencast

darktable-log-race.txt: Triggering the race bug twice in the navigation module.

darktable-log-logic.txt: Triggering the logic error on two program runs (the first time it didn't produce visible corruption somehow), then restarting darktable and tweaking parameters.

Commit

No response

Where did you obtain darktable from?

downloaded from www.darktable.org

darktable version

darktable 4.6.1

What OS are you using?

Windows

What is the version of your OS?

Windows 11 22H2 (a bit outdated?)

Describe your system?

No response

Are you using OpenCL GPU in darktable?

Yes

If yes, what is the GPU card and driver?

RX 570 4GB, Adrenalin 23.11.1

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

jenshannoschwalm commented 5 months ago

@nyanpasu64 from my understanding this might be due to wrong pixel pipe cache data in the special pipe used for duplicate and others. I believe this is already fixed in master. Would you be able to test on master? A log with '-d pipe' would be sufficient.

Maybe I didn't test correctly but was not yet able to reproduce here.

nyanpasu64 commented 5 months ago

Still happens on the latest Windows nightly 20240326.

If I corrupt a "diffuse or sharpen" effect, parts of the image turn black.

darktable_JC5o6IeLVj With blend mask enabled, black areas have stripes with a period of 4 pixels (actually 5 pixels in my screenshot because I'm running at 125% DPI scaling? Does darktable not display image data using physical display pixels?), and resizing the window changes the offset of each row of pixels relative to the previous row. Non-black pixel data seems to be passed through a pseudo-random function (in 4-pixel blocks?).

darktable-log.txt

EDIT: If I instead pull off this bug on a duplicated Exposure effect (at 100% DPI scaling), three out of every 4 pixels are replaced with a solid color (which varies randomly as I move the exposure slider of the bugged module), while the remaining pixels show the underlying image more or less (whose brightness does not change with the exposure slider): darktable_Y6ojAGNRQK

With a small enough window or high enough zoom level (so few enough pixels to calculate), I get corrupted colors (with some limited periodicity of 4 "fat" pixels) rather than a solid color: darktable_Tt24rf8Rjr

EDIT2: Bug still occurs with OpenCL off.

nyanpasu64 commented 5 months ago

In my RelWithDebInfo Windows build of darktable, interestingly I'm able to reproduce this bug with near 100% success rate (by toggling lens correction then undo/redoing, but it's a no-op because my self-built Darktable can't find the lensfun databases).

When the graphical corruption appears on screen, it seems the pixel pipe worker (dt_dev_process_image_job -> ... _dev_pixelpipe_process_rec) is erroneously called with exposure.1 as the final list element in pipe->iop (the final module passed to the outermost call to _dev_pixelpipe_process_rec()). Not sure why it happens?

Oddly enough, when I undo and redo a single operation, darktable spawns three [full] pipes, but the first two exit early because dt_iop_breakpoint(dev, pipe) is set. The second and third pipes erroneously have exposure.1 as the final operation (rather than at its correct spot in operation history).

Panning around the image spawns new jobs sent to worker threads (resulting in a new call to _dev_pixelpipe_process_rec_and_backcopy() and a new gRender_pass value), but share the same incorrect order of pipe->iop.

Fork with added logs at https://github.com/darktable-org/darktable/compare/master...nyanpasu64:darktable:debug-undo-pipe.

nyanpasu64 commented 5 months ago

Sometimes [worker thread] dt_dev_pixelpipe_create_nodes() (on the full or preview pipe) will clone dev->iop while [main thread] _pop_undo() is mutating it.

If the (worker) clone occurs near the end of the (main) call to dt_dev_reload_history_items(), it will erroneously see most effects as disabled (including cloned ones), but will see the correct order of list nodes (with gamma at the end, rather than all cloned nodes). This does not produce visible on-screen corruption.

Maybe you need to lock the pipe during _pop_undo() and dt_dev_pixelpipe_create_nodes() (perhaps other functions too) so workers won't see an incorrect iop (or any other parameters too) while it's being initialized or later mutated?

jenshannoschwalm commented 5 months ago

Thanks a lot for this analysis. Will have a look into all this...

nyanpasu64 commented 5 months ago

If I understand correctly, the order of dev->iop is incorrect (ends with duplicate effects rather than actual processing order):

I'm not sure what all these functions are doing, but I'm pretty sure the worker threads shouldn't be reading dev while dt_dev_reload_history_items() is mucking with it.

How would you fix it? Keep in mind I very much do not understand the current code and locking well enough to state correct fixes to this problem.

github-actions[bot] commented 3 months ago

This issue has been marked as stale due to inactivity for the last 60 days. It will be automatically closed in 300 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.

nyanpasu64 commented 3 months ago

The bug still occurs on the latest nightly 4.7.0+1309~g40d9f8ec7b. Is there any progress towards it being fixed?

One new symptom I've observed is that sometimes the image will not appear visually fully corrupted, but merely have the wrong color cast. Repeating the undo/redo will fix the problem as before.

jenshannoschwalm commented 3 months ago

@TurboGit i finally found time to get more into this. Some things spotted in current code i'd like to discuss - maybe bells ringing on your side. BTW i also had issues like reported here

  1. In the logs there was this cannot get iop-order for diffuse instance 1 resulting in
    153,9791 process                   CPU [full]           gamma                  (  48/   0) 1303x 956 scale=0,1713 --> (  48/   0) 1303x 956 scale=0,1713 IOP_CS_RGB
    153,9832 process                   CL0 [full]           diffuse.1              (  48/   0) 1303x 956 scale=0,1713 --> (  48/   0) 1303x 956 scale=0,1713 IOP_CS_RGB

The bad boy seems to be in dt_ioppr_get_iop_order() - i had first instance deleted and here we have an error resulting in an order after gamma - INT_MAX

  1. Could you comment on this in undo.c ? Your commit on this says 'avoid race condition'
    
    #define LOCK \
    dt_pthread_mutex_lock(&self->mutex); self->locked = TRUE

define UNLOCK \

self->locked = FALSE; dt_pthread_mutex_unlock(&self->mutex)


I think the `self->locked = FALSE` must be after the mutex unlocking because otherwise the variable would be FALSE befor being unlocked ??? Couldn't we use a recursive mutex here ?

3. I was surprised about using the non-mutex-protected variant ´dt_undo_iterate_internal()` being used in history? Could you explain? 
TurboGit commented 3 months ago

About 2.

I'm not sure it would make a difference about the self->locked place. The only code in undo.c using self->locked is:

    if(!self->locked)
    {
      LOCK;

So if self->locked is FALSE the thread will try to LOCK, it will be blocked until the mutex is actually unlocked.

TurboGit commented 3 months ago

About 3.

Because the only place where dt_undo_iterate_internal() is called in inside dt_undo_iterate() and there we LOCK & UNLOCK around the call.

TurboGit commented 3 months ago

Ok, about 3 I was wrong it is also used in history.c. This is a refactoring done long time ago... Maybe we can use the locking version of this routine now?

jenshannoschwalm commented 3 months ago

Ok, about 3 I was wrong it is also used in history.c.

Sorry i wasn't more clear about my question - this is exactly what i was referring to.

nyanpasu64 commented 2 months ago

Image corruption seems fixed. I'm noticing UI lag when switching between lighttable/darkroom, or undoing/redoing. This may or may not be worse after merging the PR (I've gotten mixed results from testing before and after merging).

(Besides the driver hang...) my idea is to have the UI thread cache the transformation used by dt_dev_distort_transform_plus locally, so it does not need to interact with multithreaded state to perform painting calculations. Really the ideal is to avoid sharing mutable state at all (worker threads perform image computations on state not accessed by the UI thread, and edits are sent via messages from the UI to a worker or by destroying/recreating a worker with the latest state).

Alternatively you could have the UI reject redraw operations while a render thread is holding a mutex (and reconfiguring the pipeline or whatnot).

jenshannoschwalm commented 2 months ago

Closing this as being fixed in master/4.8.1

Feel free to a) open another issue if still having problems or possibly b) a PR to discuss or implement improvements on undo/history handling :-)