Closed dtorop closed 1 year ago
Question: the crash seems to be triggered by
We know from the backtrace
#4 _iop_color_picker_pickerdata_ready_callback (instance=0x5605a948d7a0, module=0x5605ad70f1e0, piece=0x7ff42c00a910, user_data=0x0) at /home/dtorop/src/darktable/src/gui/color_picker_proxy.c:285
That piece
is non-null. But do we know that piece->pipe
is non-null?
Also, looking through backtrace, thread 10 appears to be busy in _hist_worker()
of _collect_histogram_on_CPU()
called from _pixelpipe_process_on_CPU()
. It is working with piece=0x7ff42c02de10
while the colorpicker has received piece=0x7ff42c00a910
. Has the latter somehow been already deallocated?
And... The main GUI thread is segfaulting due to a signal presumably produced by the prior run of the preview pipe. Is this strange?
I'm trying to wrap my head around why _iop_color_picker_pickerdata_ready_callback()
is messing with the pixelpipe state, but it's not clear to me, despite the comment.
@jenshannoschwalm: if you have a moment, I'd be so curious if this is something you have insight.
I am likely of no big help on this but
That
piece
is non-null. But do we know thatpiece->pipe
is non-null?
Yes. 1) from what i know we construct piece data only and always with a pipe defined pipe 2) I am pretty sure i would have run into problems if that would not hold :-)
thread 10 appears to be busy in
_hist_worker()
of_collect_histogram_on_CPU()
called from_pixelpipe_process_on_CPU()
. It is working withpiece=0x7ff42c02de10
while the colorpicker has receivedpiece=0x7ff42c00a910
. Has the latter somehow been already deallocated?
I am also thinking about the PIPE_REMOVE
here. Don't know either so far. The cache invalidation is clear to me, this just ensures a reprocessing of the pipe so we have valid data for the internal histogram. I am not sure here if the piece might be from full pipe and not from preview???
2 comments:
_hist_worker
we dt_alloc_align()
a new hist buffer but don't set stats->bufsize
to zero?Fix cacheline re-use for 4.4
PR, are you sure you have this in your codebase? Otherwise there could be very strange issues
as the output buffer could be the same as input! VERY unlikely on GPU but could well be on CPU after invalidating a cacheline. Would also affect colorspace conversion of the moduleI am not sure here if the piece might be from full pipe and not from preview???
I'm pretty sure both are preview pipe:
Histogram is only requested by colorzones, rgbcurve, or tonecurve, all via code which sets piece->request_histogram |= DT_REQUEST_ON;
when in preview pipe.
Picker is only processed when _request_color_pick()
is true, which depends on pipe == dev->preview_pipe
So then why in thread 10 do we see:
#6 _collect_histogram_on_CPU (pipe=0x5605ab91f400, input=0x7ff30e71f040, roi_in=<optimized out>, module=0x5605ad70f1e0, piece=0x7ff42c02de10, pixelpipe_flow=0x7ff445fe2e30, dev=<optimized out>) at /home/dtorop/src/darktable/src/develop/pixelpipe_hb.c:1120
But in thread 1 we see:
#4 _iop_color_picker_pickerdata_ready_callback (instance=0x5605a948d7a0, module=0x5605ad70f1e0, piece=0x7ff42c00a910, user_data=0x0) at /home/dtorop/src/darktable/src/gui/color_picker_proxy.c:285
picker = 0x5605ae0f7ba0
These have the same module=0x5605ad70f1e0
but a different piece address. The sequence seems to be:
_pixelpipe_process_on_CPU()
runs for preview pipe_pixelpipe_process_on_CPU()
does color picking via _pixelpipe_picker()
_pixelpipe_process_on_CPU()
raises DT_SIGNAL_CONTROL_PICKERDATA_READY
with piece=...0a910
_pixelpipe_process_on_CPU()
continues on its way, returns, pixelpipe completes_pixelpipe_process_on_CPU()
which calls _collect_histogram_on_CPU()
with piece=...2de10
DT_SIGNAL_CONTROL_PICKERDATA_READY
signal and calls _iop_color_picker_pickerdata_ready_callback()
with module=piece=...0a910
So does something happen in step 5 which could potentially deallocate/reallocate the piece
for the tonecurve iop, so that piece
which _iop_color_picker_pickerdata_ready_callback()
receives is no longer a valid pointer, and hence piece->pipe->changed
segfaults?
- Don't know how relevant this is but in
_hist_worker
wedt_alloc_align()
a new hist buffer but don't setstats->bufsize
to zero?
I think this is OK. In _hist_worker
we're potentially growing a long-running buffer, but never clearing it. The buf_size
should start at zero when it is calloc'd as part of dt_dev_pixelpipe_iop_t
in dt_dev_pixelpipe_create_nodes()
. I will grant you that this isn't so clear, and I could be wrong. Certainly other parts of the newly allocated piece
are explicitly set to 0.
- Lately in 4.4 we had the
Fix cacheline re-use for 4.4
PR, are you sure you have this in your codebase?
It is, 5e15836e7a0a9f3ec008bf0b90bea6c4923c23de is included in my branch.
I am also thinking about the
PIPE_REMOVE
here.
When dt_dev_pixelpipe_change()
is called with DT_DEV_PIPE_REMOVE
it will cleanup and recreate all the nodes in the pipe.
Is it possible that if there is a backlog of DT_SIGNAL_CONTROL_PICKERDATA_READY
signals, that the first call to _iop_color_picker_pickerdata_ready_callback()
causes the all the nodes to be recreated, but that then a second call queued up call to _iop_color_picker_pickerdata_ready_callback()
is passed the now unallocated piece, hence segfault? I could imagine this happening if the GUI was lagging a bit... Now that I think about it, I was playing audio files while I was working, and darktable + music often causes things to stutter on this computer.
If so, I'm not totally sure if there is a way for the callback to guarantee that piece
points to currently allocated memory... Maybe the work of
piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
piece->pipe->cache_obsolete = TRUE;
needs to happen before the signal is raised, if it is necessary at all to do. I'm still a bit fuzzy on what work is actually getting done by this.
Can't check as only phone for the weekend but your point about re-creating all nodes seems critical here indeed and your analysis seems to be fully correct.
Have you tried without that signalling? Recreating nodes should not be required, avoiding bad piece data should be enough.
In eee79cd6b42a8baba71a3bf9d62b3a808aaa799d I made these changes to _iop_color_picker_pickerdata_ready_callback()
:
- dt_develop_t *dev = module->dev;
- if(!picker || !dev) return;
+ if(!picker) return;
- dev->preview_pipe->changed |= DT_DEV_PIPE_REMOVE;
- dev->preview_pipe->cache_obsolete = 1;
+ piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
+ piece->pipe->cache_obsolete = 1;
With this note:
When applying an iop color picker, we should always be in the preview pipe, and hence piece->pipe == dev->preview_pipe. So don't need to figure out dev from module -- saves one reference/test.
It seems reasonable that my "optimization" is downright wrong, and @TurboGit's original code in ab795dd66a0c666b3b7db16f722165ea27727942 was the right way to do things. Checking out #6765 which added that code, and in particular the discussion in #6360, it looks like we've been down this path before...
If this reasoning is correct, then reverting the diffs above (but not the entire commit!) would be the way to go.
I do wish I could replicate this bug. I tried
modified src/gui/color_picker_proxy.c
@@ -282,6 +282,7 @@ static void _iop_color_picker_pickerdata_ready_callback(gpointer instance, dt_io
// modules between colorin & colorout may need the work_profile
// to work properly. This will force colorin to be run and it
// will set the work_profile if needed.
+ sleep(1);
piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
piece->pipe->cache_obsolete = TRUE;
In the hopes that it would trigger a crash by giving the pixelpipe a chance to get ahead of the GUI thread. Alas, no crash. In darkroom.c expose()
, the call to dt_dev_process_preview()
triggers the pixelpipe thread, and hence even though that thread is running in the background, it means that the preview pipe never gets more than one run ahead of colorpicker. But even then the piece
may be deallocated?
I can confirm that each call to _iop_color_picker_pickerdata_ready_callback()
results in piece
having a different value, as dt_dev_pixelpipe_change()
is called in the interim with DT_DEV_PIXELPIPE_REMOVE
and recreates all nodes. Perhaps it's a just utterly random when a reference to free'd memory causes a segfault, exacerbated by memory pressure?
Recreating nodes should not be required, avoiding bad piece data should be enough.
I still can't entirely wrap my head around this code, despite the comment and reading the past discussion. [A key bit of explanation seems to be this from @dterrahe:
It makes sense that if we expect to receive a signal while the pipe is being processed then taking shortcuts and skipping part of the pipe breaks things. Colorpickers will randomly work/not work depending on the location in the pipe.
(https://github.com/darktable-org/darktable/issues/6360#issuecomment-698422212)
followed by this from @TurboGit:
The hash calculation seems ok but the pipe is not properly synchronized with the color picker. That is, when you click on the pipe, the area is drawn and the hash is calculated (using the module params + the 4 box corners) but using the not yet adjusted params after having received the sample from the picker.
So the hash is calculated and stored, it will be reused (the whole point of the cache) when clicking again on the picker.... But as this did not store into the cache the new adjusted pixels nothing happen the second time.
We would probably need to avoid adjusting the cache until the params have been received and used by the module.
(https://github.com/darktable-org/darktable/issues/6360#issuecomment-699463291)
@jenshannoschwalm: is there a better way to do this? I know this overlaps with some of your recent work.
// Invalidate the cache to ensure it will be fully recomputed.
// modules between colorin & colorout may need the work_profile
// to work properly. This will force colorin to be run and it
// will set the work_profile if needed.
piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
piece->pipe->cache_obsolete = TRUE;
Thinking this out a bit more: If the reasoning above is correct, then the fix would be:
modified src/gui/color_picker_proxy.c
@@ -276,14 +276,15 @@ static void _iop_color_picker_pickerdata_ready_callback(gpointer instance, dt_io
{
// an iop colorpicker receives new data from the pixelpipe
dt_iop_color_picker_t *picker = darktable.lib->proxy.colorpicker.picker_proxy;
- if(!picker) return;
+ dt_develop_t *dev = module->dev;
+ if(!picker || !dev) return;
// Invalidate the cache to ensure it will be fully recomputed.
// modules between colorin & colorout may need the work_profile
// to work properly. This will force colorin to be run and it
// will set the work_profile if needed.
- piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
- piece->pipe->cache_obsolete = TRUE;
+ dev->preview_pipe->changed |= DT_DEV_PIPE_REMOVE;
+ dev->preview_pipe->cache_obsolete = TRUE;
// iops only need new picker data if the pointer has moved
if(_record_point_area(picker))
But if piece
is no longer valid, then we get in trouble later in the function:
if(!module->blend_data || !blend_color_picker_apply(module, picker->colorpick, piece))
if(module->color_picker_apply)
module->color_picker_apply(module, picker->colorpick, piece);
Because here we are using a possibly free'd piece
yet again.
There's a lot of code, between blend_color_picker_apply()
and the various color_picker_apply()
functions which depend on a valid piece
! Generally it seems to be used to figure out the current pipe and profile. Is it reasonable (and possible) to rewrite all of this to not depend on piece
?
Below is a rundown of the uses of piece
.
In blend_color_picker_apply()
, we have:
dt_ioppr_get_pipe_current_profile_info(piece->module, piece->pipe)
And it is used in color_picker_apply
for rgbcurve:
const dt_iop_order_iccprofile_info_t *const work_profile = dt_ioppr_get_pipe_work_profile_info(piece->pipe);
for colorbalancergb:
pipe_RGB_to_Ych(self, piece, (const float *)self->picked_color, Ych);
pipe_RGB_to_Ych(self, piece, (const float *)self->picked_color_max, max_Ych);
for basicadj:
const dt_iop_order_iccprofile_info_t *const work_profile = dt_ioppr_get_pipe_current_profile_info(self, piece->pipe);
for exposure
_auto_set_exposure(self, piece->pipe);
for channelmixerrgb
_auto_set_illuminant(self, piece->pipe);
I read this with great interest and tried to follow on the phone screen.
We have to make sure valid cache data. For that we would only need invalidation cache data for modules later than the one questioning it. BTW there is a pending pr simplifying that.
The reason for REMOVE is not clear to me, neither the full implications. But, if that forces a re-creating of all pipe nodes and thus pieces via a signal, that would be asynchronous and non-deterministic thus the picker could run wild.
The reason for REMOVE is not clear to me, neither the full implications. But, if that forces a re-creating of all pipe nodes and thus pieces via a signal, that would be asynchronous and non-deterministic thus the picker could run wild.
Indeed! That would be a situation to avoid.
Another illuminating comment from https://github.com/darktable-org/darktable/issues/6360#issuecomment-698324513:
When re-enabling the picker, I could just use the cached previous result, which would make it pass simple tests, but of course it would just contain stale data.
Looking at the changes in #5576 suggested maybe adding a call to
dt_iop_refresh_center
in_iop_color_picker_callback_button_press
would work, but it didn't make a difference. Is there another way to invalidate/force recalculation of the pipe for a particular module?Otherwise maybe a simple extra check within the pipe whether a colorpicker is active on the module to be skipped might be sufficient to fix this. At potential the cost of subsequent unnecessary recalcs; we only need the first one, and after any picker selection changes (which seems to get triggered correctly).
Boldface/italics in 3rd paragraph is mine -- even if the logic of extant code were potentially suspect, the subsequent unnecessary recalcs are also a problem.
A big-picture suggestion (and I know this is a jump from a crash/backtrace which I can't repeat): The pixelpipe cache is great at providing buffers for and retrieving the output of steps of the pixelpipe. But colorpicker and histogram results for each step are not stored (though the colorpicker request is part of the hash). We store dt_iop_colorspace_type_t
for each hash line, but not profile.
So long as colorpicker/histogram/profile data are not stored in pixelpipe cache, we have to jump through hoops (like the DT_DEV_PIPE_REMOVE
hack) to make sure that this data is current. If this data is stored in the cache, then we can use the existing hash mechanism to guarantee it is valid, or re-run as much of the pixelpipe as needed to return that data.
What if the pixelpipe cache were expanded to store, for each iop that is run:
It wouldn't be many additional bytes per hash line. We already have cache->data
and cache->dsc
, this would create three more buffers attached to each line. Two of these would be fixed size per entry (colorpicker results and colorpicker). Histogram could be fixed to the maximum histogram data size, or could be alloc'ed and free'd when needed (which wouldn't be often, there's a maximum of one histogram (and one colorpicker) per pixelpipe run.
I haven't thought through clearly what the mechanism would be for the colorpicker/scopes to retrieve the data from the hash. Perhaps the results would need to end up accessible via the appropriate dt_dev_pixelpipe_t
-- which is sort of the current situation, except that we can't pass a pointer directly into a particular node, which might not be extant.
While and after reading the source it gets even more unclear to me why we do DT_DEV_PIPE_REMOVE
in this situation at all!
This should only be done if there are pieces/modules added or removed from the pipe. And this is not the case here.
A note about cache content, yes we provide output data but in all cases with expanded module/piece header we make sure the output of the preceding module is cached with high importance as we want to keep that for fast recalculation.
Let's wait for #14697 to be merged as it reduces complexity and come back to this. From my current understanding in this code
// Invalidate the cache to ensure it will be fully recomputed.
// modules between colorin & colorout may need the work_profile
// to work properly. This will force colorin to be run and it
// will set the work_profile if needed.
piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
piece->pipe->cache_obsolete = TRUE;
it might be sufficient to just invalidate the cache for the module-in-question and later. If not, we would go for including color profile in the cache. ok?
@jenshannoschwalm: This makes sense. Also, after more thought:
piece
in color_picker_apply
is to get the pipe's work profile. We know that all pickers run in the preview pipe, hence we can retrieve this directly from dev->preview_pipe
.piece->pipe
in color_picker_apply
functions can be replace by dev->preview_pipe
.Let's wait for #14697 to be merged as it reduces complexity and come back to this.
Makes sense. I am curious to see if there's a simple way to entirely remove piece
entirely from the DT_SIGNAL_CONTROL_PICKERDATA_READY
data. If that works and it's orthogonal to #14697...
it might be sufficient to just invalidate the cache for the module-in-question and later.
Using piece->module->request_color_pick
in the pixelpipe cache hash calc was already present in ab795dd66a0c666b3b7db16f722165ea27727942 when the REMOVE
code went in. Shouldn't that have been enough to force the module to recompute? It seems like setting DT_DEV_PIPE_REMOVE
and cache_obsolete
is too big a hammer.
Incidentally, right now the color picker output is stored in dt_iop_module_t
as picked_color
... variables. As there well only be one per-iop colorpicker active at one time, this could jump up a level and be stored in dt_dev_pixelpipe_iop_t
. I'm not clear if there's a benefit for that.
I think I was too grandiose in suggesting adding histogram (maybe solved by your #14166) and colorpicker to pixelpipe cache lines. I realize that the purpose of the pixelpipe cache is to pass valid input to the next step in the pipe (ideally skipping reprocessing if valid data is in the cache). So the work is is simply for the cache to be aware when each step needs a recalculation (because input has changed, changed iop parameters, or a per-iop colorpicker/histogram request being changed).
It seems like setting DT_DEV_PIPE_REMOVE and cache_obsolete is too big a hammer.
I fully agree. There were issues with incorrect hashes now fixed. So maybe that extra hammer isn't needed any more. There are more code parts that wait for re-evaluation, likely more use of the invalidate-later strategy is possible.
piece
, and appears to work.Unfortunately, simply removing:
pipe->changed |= DT_DEV_PIPE_REMOVE;
pipe->cache_obsolete = TRUE;
doesn't work. In that case #6360 returns. The same reproduce steps from there still work:
OTOH the picker in iops like RGB curve and tone curve work noticeably more smoothly with the REMOVE
code gone. So I'd be eager to see if #14697 makes it easier to invalidate the cache entry and force reprocessing when necessary.
I have seen the PR and did a first sight. Your "unfortunately results" are to be expected as you well know.
The relevant function would be
void dt_dev_pixelpipe_cache_invalidate_later(
const struct dt_dev_pixelpipe_t *pipe,
const struct dt_iop_module_t *module)
where module
is only used to get the iop_order. Maybe that function should change to
void dt_dev_pixelpipe_cache_invalidate_later(
const struct dt_dev_pixelpipe_t *pipe,
const uint_32t order)
So we could easily implement a function invalidating cachelines for all pipes?
Apologies, I didn't write back earlier.
where
module
is only used to get the iop_order
My inclination would be to keep the signature
void dt_dev_pixelpipe_cache_invalidate_later(
const struct dt_dev_pixelpipe_t *pipe,
const struct dt_iop_module_t *module)
As is hides complexity from the caller, who in most cases will only need to write
dt_dev_pixelpipe_cache_invalidate_later(piece->pipe, self);
rather than
dt_dev_pixelpipe_cache_invalidate_later(piece->pipe, self->iop_order);
Few around here will describe themselves as fully understand the pixelpipe! (Though it's gotten easier to make sense of with recent changes.) Hiding internals a bit seems helpful for those interacting with it from other parts of dt...
So we could easily implement a function invalidating cachelines for all pipes?
That sounds great. Though we really only need to invalidate for preview pipe in the case of colorpicker.
I'm also curious if invalidating cache lines is really the right approach for making the picker work after reset. It seems like there are two sorts of pickers: One is the curve-style picker which gives a continuous readout, the other is the onr-timr action pickers (e.g. in filmic) which reads an area of the buffer in order to trigger "auto" changes to iop parameters. We don't want to sabotage performance of the pipe in the former case (by invalidating perfectly fine cache lines) to make sure that the "auto" pickers work.
I will look through the colorpicker code to try to understand this a bit better...
I really would prefer the iop_order
argument here, might be better/safer for invalidating all pipes.
might be better/safer for invalidating all pipes.
That sounds reasonable...
Describe the bug
When modifying an image with a module's color picker active, darktable crashed.
Steps to reproduce
As I recall:
I have attempted to follow these steps again, and have not yet been able to reproduce
Expected behavior
darktable should not crash
Logfile | Screenshot | Screencast
darktable_bt_TCMZ61.txt
The segfault occured here:
Commit
No response
Where did you install darktable from?
darktable.org
darktable version
4.3.0+2529~g94661332fe-dirty
What OS are you using?
Linux
What is the version of your OS?
Debian trixie/sid
Describe your system?
RAM: 32GB Wayland (!) GTK 2.24.33 Nvidia GeForce RTX 3080 Mobile, 8GB NVidia driver 525.105.17 but OpenCL was off during the crash (this system loses track of OpenCL when I suspend/resume the computer a few times, and requires reboot to re-enable OpenCL)
Are you using OpenCL GPU in darktable?
No
If yes, what is the GPU card and driver?
No response
Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip
The code I was using during the crash was compiled from the PR for #14711, which sits atop git master as of 14b1919c6836be4107204e9ef0e462f0e7964991. I believe that even though the version # identifies this as 4.3.0+2529~g94661332fe-dirty, that this is really 98b30927f0d67b14d2ec947ac584a2a1ff29cef8 but I compiled before committing that code.
I was editing a very large image (from GFX100S, 100 megapixel).
I don't have the XMP due to the infamous #7830.
I will keep an eye out for this to happen again and see if I can reproduce. But wanted to make the bug report regardless as at least I have a backtrace.