Closed nyanpasu64 closed 2 months ago
@nyanpasu64 from my understanding this might be due to wrong pixel pipe cache data in the special pipe used for duplicate and others. I believe this is already fixed in master. Would you be able to test on master? A log with '-d pipe' would be sufficient.
Maybe I didn't test correctly but was not yet able to reproduce here.
Still happens on the latest Windows nightly 20240326.
If I corrupt a "diffuse or sharpen" effect, parts of the image turn black.
With blend mask enabled, black areas have stripes with a period of 4 pixels (actually 5 pixels in my screenshot because I'm running at 125% DPI scaling? Does darktable not display image data using physical display pixels?), and resizing the window changes the offset of each row of pixels relative to the previous row. Non-black pixel data seems to be passed through a pseudo-random function (in 4-pixel blocks?).
EDIT: If I instead pull off this bug on a duplicated Exposure effect (at 100% DPI scaling), three out of every 4 pixels are replaced with a solid color (which varies randomly as I move the exposure slider of the bugged module), while the remaining pixels show the underlying image more or less (whose brightness does not change with the exposure slider):
With a small enough window or high enough zoom level (so few enough pixels to calculate), I get corrupted colors (with some limited periodicity of 4 "fat" pixels) rather than a solid color:
EDIT2: Bug still occurs with OpenCL off.
In my RelWithDebInfo Windows build of darktable, interestingly I'm able to reproduce this bug with near 100% success rate (by toggling lens correction then undo/redoing, but it's a no-op because my self-built Darktable can't find the lensfun databases).
When the graphical corruption appears on screen, it seems the pixel pipe worker (dt_dev_process_image_job -> ... _dev_pixelpipe_process_rec
) is erroneously called with exposure.1
as the final list element in pipe->iop
(the final module passed to the outermost call to _dev_pixelpipe_process_rec()
). Not sure why it happens?
Oddly enough, when I undo and redo a single operation, darktable spawns three [full]
pipes, but the first two exit early because dt_iop_breakpoint(dev, pipe)
is set. The second and third pipes erroneously have exposure.1
as the final operation (rather than at its correct spot in operation history).
dt_dev_pixelpipe_process() { GList *modules = g_list_last(pipe->iop)
with the wrong operation order (exposure.1
on top).Panning around the image spawns new jobs sent to worker threads (resulting in a new call to _dev_pixelpipe_process_rec_and_backcopy()
and a new gRender_pass
value), but share the same incorrect order of pipe->iop
.
Fork with added logs at https://github.com/darktable-org/darktable/compare/master...nyanpasu64:darktable:debug-undo-pipe.
Sometimes [worker thread] dt_dev_pixelpipe_create_nodes()
(on the full or preview pipe) will clone dev->iop
while [main thread] _pop_undo()
is mutating it.
_pop_undo()
assigns dev->iop = iop_temp;
(according to C11+ UB rules this probably needs to be a relaxed atomic store?)..._pop_undo() -> dt_dev_reload_history_items()
is mutating dev->iop
.
dt_dev_reload_history_items()
is performing this mutation (I'd probably insert more prints to narrow down what code is clashing with the worker thread). Unfortunately I may not be able to debug further tomorrow, due to life obligations.pipe->iop = g_list_copy(dev->iop)
returns an incorrect order if it executes while the main thread is running dt_dev_reload_history_items()
.If the (worker) clone occurs near the end of the (main) call to dt_dev_reload_history_items()
, it will erroneously see most effects as disabled (including cloned ones), but will see the correct order of list nodes (with gamma at the end, rather than all cloned nodes). This does not produce visible on-screen corruption.
Maybe you need to lock the pipe during _pop_undo()
and dt_dev_pixelpipe_create_nodes()
(perhaps other functions too) so workers won't see an incorrect iop (or any other parameters too) while it's being initialized or later mutated?
Thanks a lot for this analysis. Will have a look into all this...
If I understand correctly, the order of dev->iop
is incorrect (ends with duplicate effects rather than actual processing order):
dt_dev_reload_history_items() → dt_dev_pop_history_items(dev, 0) → dt_dev_pop_history_items_ext → dt_ioppr_check_duplicate_iop_order
,dt_dev_reload_history_items → dt_dev_read_history[_ext] → dt_ioppr_resync_modules_order()
.I'm not sure what all these functions are doing, but I'm pretty sure the worker threads shouldn't be reading dev
while dt_dev_reload_history_items()
is mucking with it.
dev
...
dt_dev_pixelpipe_process()
reads modules=pipe->iop
,_dev_pixelpipe_process_rec
reads modules->data->...
which is shared with dev->iop
,dt_dev_pixelpipe_create_nodes()
(which cares about the list's order) needs to be protected (and you hope the data is initialized by the time the pipe actually gets processed).How would you fix it? Keep in mind I very much do not understand the current code and locking well enough to state correct fixes to this problem.
dev
in a RWlock, and the workers read-lock it when reading module order/properties, and the main thread write-locks it upon undo/redo (and GUI operations?)
dev
, and any worker can't read-lock (since you want full and preview to render concurrently), it early-exits (similarly to if dt_iop_breakpoint(dev, pipe)
is true)?dev
, and dt_dev_reload_history_items()
(or some other code) wants to write to it (but fails to try-write-lock), it sets a flag for all workers to exit before blocking on a write-lock?dev
before sending it to workers (and never mutating this thread-shared data again), instead of mutating the one being read?
This issue has been marked as stale due to inactivity for the last 60 days. It will be automatically closed in 300 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.
The bug still occurs on the latest nightly 4.7.0+1309~g40d9f8ec7b. Is there any progress towards it being fixed?
One new symptom I've observed is that sometimes the image will not appear visually fully corrupted, but merely have the wrong color cast. Repeating the undo/redo will fix the problem as before.
@TurboGit i finally found time to get more into this. Some things spotted in current code i'd like to discuss - maybe bells ringing on your side. BTW i also had issues like reported here
cannot get iop-order for diffuse instance 1
resulting in
153,9791 process CPU [full] gamma ( 48/ 0) 1303x 956 scale=0,1713 --> ( 48/ 0) 1303x 956 scale=0,1713 IOP_CS_RGB
153,9832 process CL0 [full] diffuse.1 ( 48/ 0) 1303x 956 scale=0,1713 --> ( 48/ 0) 1303x 956 scale=0,1713 IOP_CS_RGB
The bad boy seems to be in dt_ioppr_get_iop_order()
- i had first instance deleted and here we have an error resulting in an order after gamma
- INT_MAX
#define LOCK \
dt_pthread_mutex_lock(&self->mutex); self->locked = TRUE
self->locked = FALSE; dt_pthread_mutex_unlock(&self->mutex)
I think the `self->locked = FALSE` must be after the mutex unlocking because otherwise the variable would be FALSE befor being unlocked ??? Couldn't we use a recursive mutex here ?
3. I was surprised about using the non-mutex-protected variant ´dt_undo_iterate_internal()` being used in history? Could you explain?
About 2.
I'm not sure it would make a difference about the self->locked place. The only code in undo.c
using self->locked
is:
if(!self->locked)
{
LOCK;
So if self->locked is FALSE the thread will try to LOCK, it will be blocked until the mutex is actually unlocked.
About 3.
Because the only place where dt_undo_iterate_internal()
is called in inside dt_undo_iterate()
and there we LOCK & UNLOCK around the call.
Ok, about 3 I was wrong it is also used in history.c
. This is a refactoring done long time ago... Maybe we can use the locking version of this routine now?
Ok, about 3 I was wrong it is also used in
history.c
.
Sorry i wasn't more clear about my question - this is exactly what i was referring to.
Image corruption seems fixed. I'm noticing UI lag when switching between lighttable/darkroom, or undoing/redoing. This may or may not be worse after merging the PR (I've gotten mixed results from testing before and after merging).
history_mutex
.history_mutex
and waiting on GTK operations.
dt_dev_distort_transform_plus
trying to lock history_mutex
(IDK if this PR holds it for longer at a time now).dt_dev_pop_history_items -> dt_pthread_mutex_lock(&dev->history_mutex)
while the workers never finish.(Besides the driver hang...) my idea is to have the UI thread cache the transformation used by dt_dev_distort_transform_plus
locally, so it does not need to interact with multithreaded state to perform painting calculations. Really the ideal is to avoid sharing mutable state at all (worker threads perform image computations on state not accessed by the UI thread, and edits are sent via messages from the UI to a worker or by destroying/recreating a worker with the latest state).
Alternatively you could have the UI reject redraw operations while a render thread is holding a mutex (and reconfiguring the pipeline or whatnot).
Closing this as being fixed in master/4.8.1
Feel free to a) open another issue if still having problems or possibly b) a PR to discuss or implement improvements on undo/history handling :-)
Describe the bug
If I duplicate a module, then undo and redo certain operations, Darktable displays corrupted image contents in the navigation module and/or the main image view.
Steps to reproduce
Concurrency race condition
Around 20% of the time, the duplicate second (top) module will become "corrupted" in such a way that whenever it's enabled, the resulting image appears garbled and filled with random colors (similar to the ECB penguin). The corruption will generally appear in either the navigation module or the main image view (randomly-ish).
Disabling the original copy of the module does not fix the problem; disabling the copy will hide the problem until it's reenabled (tested in 10+ occurrences of the bug). Altering module parameters does not fix the corruption. Undoing or redoing ends the corruption permanently (until you perform the steps again).
Enabling a uniform blend mask for the glitched module (or any previous module) changes the output of the corruption (due to floating point rounding errors, or memory buffers?), and dragging the slider changes the appearance as well.
Logic error
A more serious variant of this bug is both reproducible deterministically (rather than being a timing bug with a random success rate), and persists across program restarts. It leaves the duplicate instance permanently active and corrupted, but hidden from the UI.
At this point, both the image and navigation preview appear corrupted. You cannot search for the duplicated module. Undoing or redoing hides the corruption, but if you close darktable while the corruption is active, the effects become "permanent".
1
in the duplicate's name), it may already be switched off.1
after the name) will appear and function like normal.Expected behavior
darktable should not corrupt images when performing undo/redo operations with duplicate modules.
Logfile | Screenshot | Screencast
darktable-log-race.txt: Triggering the race bug twice in the navigation module.
darktable-log-logic.txt: Triggering the logic error on two program runs (the first time it didn't produce visible corruption somehow), then restarting darktable and tweaking parameters.
Commit
No response
Where did you obtain darktable from?
downloaded from www.darktable.org
darktable version
darktable 4.6.1
What OS are you using?
Windows
What is the version of your OS?
Windows 11 22H2 (a bit outdated?)
Describe your system?
No response
Are you using OpenCL GPU in darktable?
Yes
If yes, what is the GPU card and driver?
RX 570 4GB, Adrenalin 23.11.1
Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip