darktable-org / darktable

darktable is an open source photography workflow application and raw developer
https://www.darktable.org
GNU General Public License v3.0
9.82k stars 1.14k forks source link

darkroom: crash when using tone curve with color picker #14812

Closed dtorop closed 1 year ago

dtorop commented 1 year ago

Describe the bug

When modifying an image with a module's color picker active, darktable crashed.

Steps to reproduce

As I recall:

  1. Open unmodified image in darkroom view.
  2. Set to ISO 12646 mode.
  3. Adjust a few iops (exposure, sigmoid, color calibration, color balance rgb, local contrast)
  4. Activate tone curve iop and make initial adjustment
  5. Activate tone curve's color point picker and move its location in the center area
  6. Continue to adjust tone curve
  7. darktable crashes

I have attempted to follow these steps again, and have not yet been able to reproduce

Expected behavior

darktable should not crash

Logfile | Screenshot | Screencast

darktable_bt_TCMZ61.txt

The segfault occured here:

#4  _iop_color_picker_pickerdata_ready_callback (instance=0x5605a948d7a0, module=0x5605ad70f1e0, piece=0x7ff42c00a910, user_data=0x0) at /home/dtorop/src/darktable/src/gui/color_picker_proxy.c:285
        picker = 0x5605ae0f7ba0

Commit

No response

Where did you install darktable from?

darktable.org

darktable version

4.3.0+2529~g94661332fe-dirty

What OS are you using?

Linux

What is the version of your OS?

Debian trixie/sid

Describe your system?

RAM: 32GB Wayland (!) GTK 2.24.33 Nvidia GeForce RTX 3080 Mobile, 8GB NVidia driver 525.105.17 but OpenCL was off during the crash (this system loses track of OpenCL when I suspend/resume the computer a few times, and requires reboot to re-enable OpenCL)

Are you using OpenCL GPU in darktable?

No

If yes, what is the GPU card and driver?

No response

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

The code I was using during the crash was compiled from the PR for #14711, which sits atop git master as of 14b1919c6836be4107204e9ef0e462f0e7964991. I believe that even though the version # identifies this as 4.3.0+2529~g94661332fe-dirty, that this is really 98b30927f0d67b14d2ec947ac584a2a1ff29cef8 but I compiled before committing that code.

I was editing a very large image (from GFX100S, 100 megapixel).

I don't have the XMP due to the infamous #7830.

I will keep an eye out for this to happen again and see if I can reproduce. But wanted to make the bug report regardless as at least I have a backtrace.

dtorop commented 1 year ago

Question: the crash seems to be triggered by

https://github.com/darktable-org/darktable/blob/1c8908fcff46135849dc6f43d4728bddbc9c8108/src/gui/color_picker_proxy.c#L285

We know from the backtrace

#4  _iop_color_picker_pickerdata_ready_callback (instance=0x5605a948d7a0, module=0x5605ad70f1e0, piece=0x7ff42c00a910, user_data=0x0) at /home/dtorop/src/darktable/src/gui/color_picker_proxy.c:285

That piece is non-null. But do we know that piece->pipe is non-null?

Also, looking through backtrace, thread 10 appears to be busy in _hist_worker() of _collect_histogram_on_CPU() called from _pixelpipe_process_on_CPU(). It is working with piece=0x7ff42c02de10 while the colorpicker has received piece=0x7ff42c00a910. Has the latter somehow been already deallocated?

And... The main GUI thread is segfaulting due to a signal presumably produced by the prior run of the preview pipe. Is this strange?

I'm trying to wrap my head around why _iop_color_picker_pickerdata_ready_callback() is messing with the pixelpipe state, but it's not clear to me, despite the comment.

@jenshannoschwalm: if you have a moment, I'd be so curious if this is something you have insight.

jenshannoschwalm commented 1 year ago

I am likely of no big help on this but

That piece is non-null. But do we know that piece->pipe is non-null?

Yes. 1) from what i know we construct piece data only and always with a pipe defined pipe 2) I am pretty sure i would have run into problems if that would not hold :-)

thread 10 appears to be busy in _hist_worker() of _collect_histogram_on_CPU() called from _pixelpipe_process_on_CPU(). It is working with piece=0x7ff42c02de10 while the colorpicker has received piece=0x7ff42c00a910. Has the latter somehow been already deallocated?

I am also thinking about the PIPE_REMOVE here. Don't know either so far. The cache invalidation is clear to me, this just ensures a reprocessing of the pipe so we have valid data for the internal histogram. I am not sure here if the piece might be from full pipe and not from preview???

2 comments:

  1. Don't know how relevant this is but in _hist_worker we dt_alloc_align() a new hist buffer but don't set stats->bufsize to zero?
  2. Lately in 4.4 we had the Fix cacheline re-use for 4.4 PR, are you sure you have this in your codebase? Otherwise there could be very strange issues as the output buffer could be the same as input! VERY unlikely on GPU but could well be on CPU after invalidating a cacheline. Would also affect colorspace conversion of the module
dtorop commented 1 year ago

I am not sure here if the piece might be from full pipe and not from preview???

I'm pretty sure both are preview pipe:

So then why in thread 10 do we see:

#6  _collect_histogram_on_CPU (pipe=0x5605ab91f400, input=0x7ff30e71f040, roi_in=<optimized out>, module=0x5605ad70f1e0, piece=0x7ff42c02de10, pixelpipe_flow=0x7ff445fe2e30, dev=<optimized out>) at /home/dtorop/src/darktable/src/develop/pixelpipe_hb.c:1120

But in thread 1 we see:

#4  _iop_color_picker_pickerdata_ready_callback (instance=0x5605a948d7a0, module=0x5605ad70f1e0, piece=0x7ff42c00a910, user_data=0x0) at /home/dtorop/src/darktable/src/gui/color_picker_proxy.c:285
        picker = 0x5605ae0f7ba0

These have the same module=0x5605ad70f1e0 but a different piece address. The sequence seems to be:

  1. _pixelpipe_process_on_CPU() runs for preview pipe
  2. _pixelpipe_process_on_CPU() does color picking via _pixelpipe_picker()
  3. _pixelpipe_process_on_CPU() raises DT_SIGNAL_CONTROL_PICKERDATA_READY with piece=...0a910
  4. _pixelpipe_process_on_CPU() continues on its way, returns, pixelpipe completes
  5. As user continues to fiddle with iop, we do yet another run of the preview pipe
  6. In thread 10, flow goes back to _pixelpipe_process_on_CPU() which calls _collect_histogram_on_CPU() with piece=...2de10
  7. meantime the GUI thread gets around to handling the DT_SIGNAL_CONTROL_PICKERDATA_READY signal and calls _iop_color_picker_pickerdata_ready_callback() with module=piece=...0a910
  8. here something goes awry and segfault

So does something happen in step 5 which could potentially deallocate/reallocate the piece for the tonecurve iop, so that piece which _iop_color_picker_pickerdata_ready_callback() receives is no longer a valid pointer, and hence piece->pipe->changed segfaults?

  • Don't know how relevant this is but in _hist_worker we dt_alloc_align() a new hist buffer but don't set stats->bufsize to zero?

I think this is OK. In _hist_worker we're potentially growing a long-running buffer, but never clearing it. The buf_size should start at zero when it is calloc'd as part of dt_dev_pixelpipe_iop_t in dt_dev_pixelpipe_create_nodes(). I will grant you that this isn't so clear, and I could be wrong. Certainly other parts of the newly allocated piece are explicitly set to 0.

  • Lately in 4.4 we had the Fix cacheline re-use for 4.4 PR, are you sure you have this in your codebase?

It is, 5e15836e7a0a9f3ec008bf0b90bea6c4923c23de is included in my branch.

dtorop commented 1 year ago

I am also thinking about the PIPE_REMOVE here.

When dt_dev_pixelpipe_change() is called with DT_DEV_PIPE_REMOVE it will cleanup and recreate all the nodes in the pipe.

Is it possible that if there is a backlog of DT_SIGNAL_CONTROL_PICKERDATA_READY signals, that the first call to _iop_color_picker_pickerdata_ready_callback() causes the all the nodes to be recreated, but that then a second call queued up call to _iop_color_picker_pickerdata_ready_callback() is passed the now unallocated piece, hence segfault? I could imagine this happening if the GUI was lagging a bit... Now that I think about it, I was playing audio files while I was working, and darktable + music often causes things to stutter on this computer.

If so, I'm not totally sure if there is a way for the callback to guarantee that piece points to currently allocated memory... Maybe the work of

  piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
  piece->pipe->cache_obsolete = TRUE;

needs to happen before the signal is raised, if it is necessary at all to do. I'm still a bit fuzzy on what work is actually getting done by this.

jenshannoschwalm commented 1 year ago

Can't check as only phone for the weekend but your point about re-creating all nodes seems critical here indeed and your analysis seems to be fully correct.

Have you tried without that signalling? Recreating nodes should not be required, avoiding bad piece data should be enough.

dtorop commented 1 year ago

In eee79cd6b42a8baba71a3bf9d62b3a808aaa799d I made these changes to _iop_color_picker_pickerdata_ready_callback():

-  dt_develop_t *dev = module->dev;
-  if(!picker || !dev) return;
+  if(!picker) return;
-  dev->preview_pipe->changed |= DT_DEV_PIPE_REMOVE;
-  dev->preview_pipe->cache_obsolete = 1;
+  piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
+  piece->pipe->cache_obsolete = 1;

With this note:

When applying an iop color picker, we should always be in the preview pipe, and hence piece->pipe == dev->preview_pipe. So don't need to figure out dev from module -- saves one reference/test.

It seems reasonable that my "optimization" is downright wrong, and @TurboGit's original code in ab795dd66a0c666b3b7db16f722165ea27727942 was the right way to do things. Checking out #6765 which added that code, and in particular the discussion in #6360, it looks like we've been down this path before...

If this reasoning is correct, then reverting the diffs above (but not the entire commit!) would be the way to go.

I do wish I could replicate this bug. I tried

modified   src/gui/color_picker_proxy.c
@@ -282,6 +282,7 @@ static void _iop_color_picker_pickerdata_ready_callback(gpointer instance, dt_io
   // modules between colorin & colorout may need the work_profile
   // to work properly. This will force colorin to be run and it
   // will set the work_profile if needed.
+  sleep(1);
   piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
   piece->pipe->cache_obsolete = TRUE;

In the hopes that it would trigger a crash by giving the pixelpipe a chance to get ahead of the GUI thread. Alas, no crash. In darkroom.c expose(), the call to dt_dev_process_preview() triggers the pixelpipe thread, and hence even though that thread is running in the background, it means that the preview pipe never gets more than one run ahead of colorpicker. But even then the piece may be deallocated?

I can confirm that each call to _iop_color_picker_pickerdata_ready_callback() results in piece having a different value, as dt_dev_pixelpipe_change() is called in the interim with DT_DEV_PIXELPIPE_REMOVE and recreates all nodes. Perhaps it's a just utterly random when a reference to free'd memory causes a segfault, exacerbated by memory pressure?

Recreating nodes should not be required, avoiding bad piece data should be enough.

I still can't entirely wrap my head around this code, despite the comment and reading the past discussion. [A key bit of explanation seems to be this from @dterrahe:

It makes sense that if we expect to receive a signal while the pipe is being processed then taking shortcuts and skipping part of the pipe breaks things. Colorpickers will randomly work/not work depending on the location in the pipe.

(https://github.com/darktable-org/darktable/issues/6360#issuecomment-698422212)

followed by this from @TurboGit:

The hash calculation seems ok but the pipe is not properly synchronized with the color picker. That is, when you click on the pipe, the area is drawn and the hash is calculated (using the module params + the 4 box corners) but using the not yet adjusted params after having received the sample from the picker.

So the hash is calculated and stored, it will be reused (the whole point of the cache) when clicking again on the picker.... But as this did not store into the cache the new adjusted pixels nothing happen the second time.

We would probably need to avoid adjusting the cache until the params have been received and used by the module.

(https://github.com/darktable-org/darktable/issues/6360#issuecomment-699463291)

@jenshannoschwalm: is there a better way to do this? I know this overlaps with some of your recent work.

  // Invalidate the cache to ensure it will be fully recomputed.
  // modules between colorin & colorout may need the work_profile
  // to work properly. This will force colorin to be run and it
  // will set the work_profile if needed.
  piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
  piece->pipe->cache_obsolete = TRUE;
dtorop commented 1 year ago

Thinking this out a bit more: If the reasoning above is correct, then the fix would be:

modified   src/gui/color_picker_proxy.c
@@ -276,14 +276,15 @@ static void _iop_color_picker_pickerdata_ready_callback(gpointer instance, dt_io
 {
   // an iop colorpicker receives new data from the pixelpipe
   dt_iop_color_picker_t *picker = darktable.lib->proxy.colorpicker.picker_proxy;
-  if(!picker) return;
+  dt_develop_t *dev = module->dev;
+  if(!picker || !dev) return;

   // Invalidate the cache to ensure it will be fully recomputed.
   // modules between colorin & colorout may need the work_profile
   // to work properly. This will force colorin to be run and it
   // will set the work_profile if needed.
-  piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
-  piece->pipe->cache_obsolete = TRUE;
+  dev->preview_pipe->changed |= DT_DEV_PIPE_REMOVE;
+  dev->preview_pipe->cache_obsolete = TRUE;

   // iops only need new picker data if the pointer has moved
   if(_record_point_area(picker))

But if piece is no longer valid, then we get in trouble later in the function:

    if(!module->blend_data || !blend_color_picker_apply(module, picker->colorpick, piece))
      if(module->color_picker_apply)
        module->color_picker_apply(module, picker->colorpick, piece);

Because here we are using a possibly free'd piece yet again.

There's a lot of code, between blend_color_picker_apply() and the various color_picker_apply() functions which depend on a valid piece! Generally it seems to be used to figure out the current pipe and profile. Is it reasonable (and possible) to rewrite all of this to not depend on piece?

Below is a rundown of the uses of piece.

In blend_color_picker_apply(), we have:

dt_ioppr_get_pipe_current_profile_info(piece->module, piece->pipe)

And it is used in color_picker_apply for rgbcurve:

    const dt_iop_order_iccprofile_info_t *const work_profile = dt_ioppr_get_pipe_work_profile_info(piece->pipe);

for colorbalancergb:

  pipe_RGB_to_Ych(self, piece, (const float *)self->picked_color, Ych);
  pipe_RGB_to_Ych(self, piece, (const float *)self->picked_color_max, max_Ych);

for basicadj:

  const dt_iop_order_iccprofile_info_t *const work_profile = dt_ioppr_get_pipe_current_profile_info(self, piece->pipe);

for exposure

  _auto_set_exposure(self, piece->pipe);

for channelmixerrgb

  _auto_set_illuminant(self, piece->pipe);
jenshannoschwalm commented 1 year ago

I read this with great interest and tried to follow on the phone screen.

We have to make sure valid cache data. For that we would only need invalidation cache data for modules later than the one questioning it. BTW there is a pending pr simplifying that.

The reason for REMOVE is not clear to me, neither the full implications. But, if that forces a re-creating of all pipe nodes and thus pieces via a signal, that would be asynchronous and non-deterministic thus the picker could run wild.

dtorop commented 1 year ago

The reason for REMOVE is not clear to me, neither the full implications. But, if that forces a re-creating of all pipe nodes and thus pieces via a signal, that would be asynchronous and non-deterministic thus the picker could run wild.

Indeed! That would be a situation to avoid.

Another illuminating comment from https://github.com/darktable-org/darktable/issues/6360#issuecomment-698324513:

When re-enabling the picker, I could just use the cached previous result, which would make it pass simple tests, but of course it would just contain stale data.

Looking at the changes in #5576 suggested maybe adding a call to dt_iop_refresh_center in _iop_color_picker_callback_button_press would work, but it didn't make a difference. Is there another way to invalidate/force recalculation of the pipe for a particular module?

Otherwise maybe a simple extra check within the pipe whether a colorpicker is active on the module to be skipped might be sufficient to fix this. At potential the cost of subsequent unnecessary recalcs; we only need the first one, and after any picker selection changes (which seems to get triggered correctly).

Boldface/italics in 3rd paragraph is mine -- even if the logic of extant code were potentially suspect, the subsequent unnecessary recalcs are also a problem.

A big-picture suggestion (and I know this is a jump from a crash/backtrace which I can't repeat): The pixelpipe cache is great at providing buffers for and retrieving the output of steps of the pixelpipe. But colorpicker and histogram results for each step are not stored (though the colorpicker request is part of the hash). We store dt_iop_colorspace_type_t for each hash line, but not profile.

So long as colorpicker/histogram/profile data are not stored in pixelpipe cache, we have to jump through hoops (like the DT_DEV_PIPE_REMOVE hack) to make sure that this data is current. If this data is stored in the cache, then we can use the existing hash mechanism to guarantee it is valid, or re-run as much of the pixelpipe as needed to return that data.

What if the pixelpipe cache were expanded to store, for each iop that is run:

It wouldn't be many additional bytes per hash line. We already have cache->data and cache->dsc, this would create three more buffers attached to each line. Two of these would be fixed size per entry (colorpicker results and colorpicker). Histogram could be fixed to the maximum histogram data size, or could be alloc'ed and free'd when needed (which wouldn't be often, there's a maximum of one histogram (and one colorpicker) per pixelpipe run.

I haven't thought through clearly what the mechanism would be for the colorpicker/scopes to retrieve the data from the hash. Perhaps the results would need to end up accessible via the appropriate dt_dev_pixelpipe_t -- which is sort of the current situation, except that we can't pass a pointer directly into a particular node, which might not be extant.

jenshannoschwalm commented 1 year ago

While and after reading the source it gets even more unclear to me why we do DT_DEV_PIPE_REMOVE in this situation at all!

This should only be done if there are pieces/modules added or removed from the pipe. And this is not the case here.

A note about cache content, yes we provide output data but in all cases with expanded module/piece header we make sure the output of the preceding module is cached with high importance as we want to keep that for fast recalculation.

Let's wait for #14697 to be merged as it reduces complexity and come back to this. From my current understanding in this code

// Invalidate the cache to ensure it will be fully recomputed.
   // modules between colorin & colorout may need the work_profile
   // to work properly. This will force colorin to be run and it
   // will set the work_profile if needed.
piece->pipe->changed |= DT_DEV_PIPE_REMOVE;
piece->pipe->cache_obsolete = TRUE;

it might be sufficient to just invalidate the cache for the module-in-question and later. If not, we would go for including color profile in the cache. ok?

dtorop commented 1 year ago

@jenshannoschwalm: This makes sense. Also, after more thought:

Let's wait for #14697 to be merged as it reduces complexity and come back to this.

Makes sense. I am curious to see if there's a simple way to entirely remove piece entirely from the DT_SIGNAL_CONTROL_PICKERDATA_READY data. If that works and it's orthogonal to #14697...

it might be sufficient to just invalidate the cache for the module-in-question and later.

Using piece->module->request_color_pick in the pixelpipe cache hash calc was already present in ab795dd66a0c666b3b7db16f722165ea27727942 when the REMOVE code went in. Shouldn't that have been enough to force the module to recompute? It seems like setting DT_DEV_PIPE_REMOVE and cache_obsolete is too big a hammer.

Incidentally, right now the color picker output is stored in dt_iop_module_t as picked_color... variables. As there well only be one per-iop colorpicker active at one time, this could jump up a level and be stored in dt_dev_pixelpipe_iop_t. I'm not clear if there's a benefit for that.

I think I was too grandiose in suggesting adding histogram (maybe solved by your #14166) and colorpicker to pixelpipe cache lines. I realize that the purpose of the pixelpipe cache is to pass valid input to the next step in the pipe (ideally skipping reprocessing if valid data is in the cache). So the work is is simply for the cache to be aware when each step needs a recalculation (because input has changed, changed iop parameters, or a per-iop colorpicker/histogram request being changed).

jenshannoschwalm commented 1 year ago

It seems like setting DT_DEV_PIPE_REMOVE and cache_obsolete is too big a hammer.

I fully agree. There were issues with incorrect hashes now fixed. So maybe that extra hammer isn't needed any more. There are more code parts that wait for re-evaluation, likely more use of the invalidate-later strategy is possible.

dtorop commented 1 year ago

14832 should be a reasonable solution to make color picker callback not rely on piece, and appears to work.

Unfortunately, simply removing:

  pipe->changed |= DT_DEV_PIPE_REMOVE;
  pipe->cache_obsolete = TRUE;

doesn't work. In that case #6360 returns. The same reproduce steps from there still work:

  1. Chose a picture, with slight over exposure (to see effect of color picker adjustments)
  2. Click on color picker for "whilte relative exposure" or "black relative exposure"
  3. Reset parameters of filmic (e.g. you are not satisfied with result)
  4. re-do step (2)
  5. see error filmic remains on the std. parameters unless you redraw the area for the eye dropper

OTOH the picker in iops like RGB curve and tone curve work noticeably more smoothly with the REMOVE code gone. So I'd be eager to see if #14697 makes it easier to invalidate the cache entry and force reprocessing when necessary.

jenshannoschwalm commented 1 year ago

I have seen the PR and did a first sight. Your "unfortunately results" are to be expected as you well know.

The relevant function would be

void dt_dev_pixelpipe_cache_invalidate_later(
        const struct dt_dev_pixelpipe_t *pipe,
        const struct dt_iop_module_t *module)

where module is only used to get the iop_order. Maybe that function should change to

void dt_dev_pixelpipe_cache_invalidate_later(
        const struct dt_dev_pixelpipe_t *pipe,
        const uint_32t order)

So we could easily implement a function invalidating cachelines for all pipes?

dtorop commented 1 year ago

Apologies, I didn't write back earlier.

where module is only used to get the iop_order

My inclination would be to keep the signature

void dt_dev_pixelpipe_cache_invalidate_later(
        const struct dt_dev_pixelpipe_t *pipe,
        const struct dt_iop_module_t *module)

As is hides complexity from the caller, who in most cases will only need to write

        dt_dev_pixelpipe_cache_invalidate_later(piece->pipe, self);

rather than

        dt_dev_pixelpipe_cache_invalidate_later(piece->pipe, self->iop_order);

Few around here will describe themselves as fully understand the pixelpipe! (Though it's gotten easier to make sense of with recent changes.) Hiding internals a bit seems helpful for those interacting with it from other parts of dt...

So we could easily implement a function invalidating cachelines for all pipes?

That sounds great. Though we really only need to invalidate for preview pipe in the case of colorpicker.

I'm also curious if invalidating cache lines is really the right approach for making the picker work after reset. It seems like there are two sorts of pickers: One is the curve-style picker which gives a continuous readout, the other is the onr-timr action pickers (e.g. in filmic) which reads an area of the buffer in order to trigger "auto" changes to iop parameters. We don't want to sabotage performance of the pipe in the former case (by invalidating perfectly fine cache lines) to make sure that the "auto" pickers work.

I will look through the colorpicker code to try to understand this a bit better...

jenshannoschwalm commented 1 year ago

I really would prefer the iop_order argument here, might be better/safer for invalidating all pipes.

dtorop commented 1 year ago

might be better/safer for invalidating all pipes.

That sounds reasonable...