Closed andy-hnq closed 2 years ago
is this reproducible in an fresh darktable environemt (cleaned \AppData\Local\darktable - don't forget to backup first - and also cache directory \AppData\Local\Microsoft\Windows\INetCache\darktable)? can you provide the cr2 (at least camera model name to be able to use public cr2 files) and xmp files?
Thanks for the reply.
At the moment the issue is not reproducible even without any cleanup. It may be that my efforts to repeat what I was doing (both with the same images and with new ones) were missing something. I should be capturing and processing some similar images this week so I will see if it recurs and if so try to figure out a more precise set of steps and upload the relevant files.
The camera is a Canon EOS 70D.
Hardly 'reproduced' but I have spent several hours processing some TIF colour negative scans and had one more crash towards the end of the session. The better images had been given two stars and I had the film roll filtered to two stars (the same was likely true for the crashes on the 9th). I had the first image open and clicked the thumbnail strip to select the fifth. Unfortunately by that stage in the session the edit position was far from simple. All of the images had been edited plenty with negadoctor, crop (some crop and rotate), orientation for some, colour balance rgb (some with multiple instances), exposure for some, filmic rgb for some, haze removal for some and retouch for one to remove a dust spot.
DT did seem to start struggling in some cases to update the image when tweaking settings, for example in colour balance rgb before it crashed. This may or may not be relevant.
So this time TIF instead of CR2 and negative scans instead of DSLR images.
I started DT again and tried to repeat it without success.
I can reproduce this fairly consistently with DT up to v3.6.1 on Gentoo (portage up to date) since a couple of days, so may be related to a recently updated dependency?
Steps to reproduce (regardless whether fresh profile or not):
I can not reproduce this with DT built from master as of yesterday, though. I haven't tracked the issue down to a certain commit but would be willing to check with some guidance.
Thanks for testing. Do you get a backtrace that looks like mine?
I've had no further crashes since my previous post. Nothing has changed AFAIK.
reading your backtraces, this seems to comes from modulegroups module. But sadly those windows backtraces are not really helpful...
@eternalinflation : to be sure I understand you right, can you confirm that you have the issue with 3.6.1 but not with current master ? If it's the case, then I guess that one of the recent changes in modulegroup has fixed the issue, and then upcoming 3.8 will be ok...
@AlicVB It turned out that any build from Git works OK incl. v3.6.1, so in my case the issue seems at least related to the ebuild, i.e. the package configuration, and the 'fix' is likely due to the build environment rather than a certain commit. I'm attaching logs as a diff between the output from the direct Git build (left) and the package/portage build (right): build-diffs.log as well as the bt from the package-based build: darktable_bt_AASUC1.txt (darktable run with -t 1), hopefully these provide some insight. Even if this issue may be related to the build environment I'd think it'd still be worth looking at why this is causing a segfault. Let me know what else I could test / provide.
Another crash - once again I hit the spacebar to move to the next image in darkroom. I'd been using DT for literally hours before this and had done the same thing many times without a crash.
This one is in libgtk... possibly a different issue, but like I said, the same action on my part.
One possible common factor - I had done a selective copy/selective paste on several modules to replicate colour balance rgb (two instances, one with a paramateric mask), filmic rgb (off on this occasion), exposure and was then tweaking each image individually. I had changed the tint in the WB module on the image just before the crash.
I tried to repeat it with the same modules copied from the same surce image onto some other target images after restarting but DT did not crash.
Please provide a xmp with a scenario that crashes. Without no one is able to reproduce or get an idea what could be causal. Are you using plain standard raw files?
Hi, I completely understand the need to reproduce this but the first step in doing that is for me to be able to repeat the issue on my own machine. I have not been able to do that - apart from the three crashes in quick succession on the 9th Nov this has proved to be quite rare. I will keep striving to figure out a set of repeatable factors but with so few crashes it is difficult.
Four of the five crashes (three on the 9th and one yesterday) have been with CR2 raws from my Canon EOS 70D. One (Nov 15th) was with a TIFF produced by VueScan from a negative scanner.
I think I can reproducibly nail it down to the following:
Thread 8 "worker res 0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe48be640 (LWP 242575)]
0x00007fffcd1aa077 in pixel_correction (exposure=<optimized out>, factors=factors@entry=0x5555578e2bf0, sigma=sigma@entry=1.41421354) at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/iop/toneequal.c:783
783 result += gaussian_func(expo - centers_ops[i], gauss_denom) * factors[i];
(gdb) bt
#0 0x00007fffcd1aa077 in pixel_correction (exposure=<optimized out>, factors=factors@entry=0x5555578e2bf0, sigma=sigma@entry=1.41421354)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/iop/toneequal.c:783
#1 0x00007fffcd1aa174 in compute_lut_correction._omp_fn.0 () at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/iop/toneequal.c:1430
#2 0x00007ffff7a0ad76 in GOMP_parallel () at /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libgomp.so.1
#3 0x00007fffcd189643 in compute_lut_correction (g=<optimized out>, offset=<optimized out>, scaling=<optimized out>)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/iop/toneequal.c:1421
#4 0x00007fffcd1cb50b in update_curve_lut (self=self@entry=0x5555577f9b30)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/iop/toneequal.c:1475
#5 0x00007fffcd1cc709 in commit_params (self=0x5555577f9b30, p1=0x555558e57ca0, pipe=<optimized out>, piece=<optimized out>)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/iop/toneequal.c:1537
#6 0x00007ffff7d6cbab in dt_iop_commit_params
(module=0x5555577f9b30, params=0x555558e57ca0, blendop_params=0x555559d0ae10, pipe=pipe@entry=0x5555566e0400, piece=0x7fffc803e9d0)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/develop/imageop.c:1768
#7 0x00007ffff7db50a6 in dt_dev_pixelpipe_synch_all (pipe=0x5555566e0400, dev=0x5555566b2000)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/develop/pixelpipe_hb.c:411
#8 0x00007ffff7db54d0 in dt_dev_pixelpipe_change (pipe=0x5555566e0400, dev=dev@entry=0x5555566b2000)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/develop/pixelpipe_hb.c:458
#9 0x00007ffff7d65ee5 in dt_dev_process_image_job (dev=0x5555566b2000)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/develop/develop.c:575
#10 0x00007ffff7d008c1 in dt_dev_process_image_job_run (job=<optimized out>)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/control/jobs/develop_jobs.c:55
#11 0x00007ffff7cf906a in dt_control_run_job_res (res=0, control=0x5555555985f0)
at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/control/jobs.c:215
#12 dt_control_work_res (ptr=<optimized out>) at /usr/src/debug/media-gfx/darktable-3.6.1-r2/darktable-3.6.1/src/control/jobs.c:519
#13 0x00007ffff7b73cfe in start_thread () at /lib64/libpthread.so.0
#14 0x00007ffff337024f in clone () at /lib64/libc.so.6
(gdb)
Given
#ifdef _OPENMP
#pragma omp simd aligned(centers_ops, factors:64) safelen(PIXEL_CHAN) reduction(+:result)
#endif
right before the line that triggers the SIGSEGV I rebuilt DT without _OPENMP - and the problem is gone for me.
CPU is:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz
stepping : 3
microcode : 0xea
Hmm... I thought I had reproduced it. I managed 3 libgtk crashes in quick succession on the same image and was trying to figure out a summmarised set of steps and then... the crashes stopped.
I had decided to try a fresh profle and cache directory on the off-chance that this might make things worse instead of better.. .and it seemed to do exactly that but I could not then repeat the same result.
What I did observe was that the history stack in the database (and UI) seems to be getting out of sync with the XMP for some images. I'm not certain how this happens but I have observed it a few times. It may be related to going backwards in the history stack and then applying new changes to replace those that I did not like.
I'm not sure whether the history sync symptom is related to the crashes but I am sure that I saw this issue on the image that triggered the three crashes today. With the benefit of hindsight what I should have done was to take a copy of the XMP files straight after the crash.
I do think that my fresh profile and cache prove that the problem is not being caused by some fossilised stuff haging around from an older DT version.
I was looking for any issues related to XMP/History and found this one. Sounds pretty similar
Indeed, very similar. This happens into the modulegroups lib! Looks like we have some kind of memory corruption there. Would be nice to be able to reproduce!!!!
yes I've spotted modulegroups too. but what is surprising and I can't understand is that according to @eternalinflation test, this doesn't happen when he compile any git versions. iiuc, that would point to some compilation parameters that are not handled well by dt... but in modulegroups, iirc, there's absolutely nothing fancy in term of computation or whatever... no openmp, etc...
second problem, is that -of course- I've not managed to reproduce this issue, so I'll have a look at modulegroup code, but don't expect too much...
I think there may be multiple issues here, for certain I get backtraces both in modulegroups and in libgtk. Both happen when selecting a new image inside darkroom. Are they the same or different? I don't know. The same question can be asked of the Linux issues - one of these happens in lighttable rather than darkroom. I'm not suggesting that we raise multiple issues at this time - I don't think we know enough yet but we should keep an open mind and not assume that we are all talking about the same problem.
I've been trying to reproduce the XMP/database mismatch but so far without success. It definitely happened - several of my images showed up much brighter and some were majorly clipping when I imported them into the clean DT profile... which should not have happened if the XMP was correctly reflecting what I'd been doing before I reset the profile.
Although my issue may have a different reason I thought I'd still share my findings:
I built my (crashing) binaries with the --fomit-frame-pointer --ftree-vectorize
gcc flags, especially the latter quite possibly doesn't play nice with the _OPENMP option enabled (see SIMD pragma and possibly resulting multiple attempts at vectorization). Removing that gcc flag generates a working binary also with _OPENMP enabled.
The build script from Git is using its own, sane gcc options, which is why the direct Git built works as well.
I hope this saves time for anyone else who's running into the same issue.
EDIT: Set of 'custom' flags leading to a crash together with _OPENMP enabled are -march=native -O2
. I believe that both -O2
and -ftree-vectorize
enable -ftree-loop-vectorize
and -ftree-slp-vectorize
.
EDIT 2: I went back to gcc v10.3.0, rebuilt and the crash is no longer reproducible for me. Unfortunately I didn't get to the bottom of it but it looks like it's related to changes in gcc's optimizer between v10.3.0 and v11.2.0. I wouldn't call debugging optimzed binaries one of my strengths, so let me know what else I could provide to shed more light into this.
you're right, I think that there's multiple issues. hopefully one of them is now understood (fixed ?) reading last @eternalinflation msg. So there's now the "modulegroup" one (if we consider they are not the same. For this one, are you able to test last dev version ? I know you are on windows, but even if you are not in position to compile by yourself, you may grab a very recent development version on pixl.us forum... Thanks
@AlicVB : I don't have time for a PR tonight, but will do tomorrow morning if not don't beat me on it... This looks wrong:
in _dt_dev_image_changed_callback()
:
if(sqlite3_step(stmt) == SQLITE_ROW)
{
const char *preset = (char *)sqlite3_column_blob(stmt, 0);
dt_lib_presets_apply(preset, self->plugin_name, self->version());
}
The preset is not a blob but the a name as per SELECT above.
@AlicVB : I have fixed this part this morning.
@andy-hnq : What would help a lot here is to be able to get a backtrace from GDB. Sadly I cannot help on Windows.
@andy-hnq : What would help a lot here is to be able to get a backtrace from GDB. Sadly I cannot help on Windows.
If anyone knows how to do a darktable postmortem (on Windows) then I'd very much like to know. My C/C++ skills are very rusty indeed and I no longer have tools like Visual Studio available.
Following up on the DB/XMP mismatch. I queried the sqlite DB and wrote some perl to find any remaining mismatches. The ones that I found yesterday have been resolved by my attempts to reproduce the crash but there were still six remaining. In all cases the XMP had fewer steps than the DB/UI.
\2021-06/IMG_5412.CR2.xmp db ops = 17, xmp ops = 16 <== missing monochrome(off) at end. monochrome(on) is not in stack \2021_05_30/IMG_5335.CR2.xmp db ops = 27, xmp ops = 0 <== xmp file is all nulls (!) \2021_07_20/IMG_5704.CR2.xmp db ops = 24, xmp ops = 23 <== missing sharpen(off) at end. sharpen(on) is not in stack \2021_08_02/IMG_5848.CR2.xmp db ops = 15, xmp ops = 14 <== missing filmic at end - prev step in UI is also filmic, both are enabled \2021_11_16/IMG_6408.CR2.xmp db ops = 20, xmp ops = 19 <== missing WB at end, prev step in UI is WB, both enabled \2018-02/IMG_3335.CR2.xmp db ops = 21, xmp ops = 15 <== missing default modules at beginning.
I don't know if this is related to the crash but it was definitely impacting the image that caused my crashes yesterday. I'd also say there is a good chance that it impacted the other images that have crashed. On the other hand, I have not so far had a crash with any of the remaining images.
Some thoughts...
I just realized that Github did not generate a notification after I edited an earlier comment of mine https://github.com/darktable-org/darktable/issues/10384#issuecomment-977125968. For everybody who's interested please re-read the comment. Just to clarify my issue is triggered by switching between lighttable and darkroom as mentioned in this comment. I wasn't able to reproduce when staying in darkroom using either space bar or the film strip.
I was unable to find anything like a nightly Win32 build so I followed the instructions at https://github.com/darktable-org/darktable/blob/master/packaging/windows/BUILD.md and built my own installer from git.
I got the build to run after a few issues and have just installed it. I will report back if any further crashes happen.
A question... back when I got paid to write C/C++ we used a debug memory allocator that overwrote memory with known junk values just after allocation and just before free. It was pretty good at improving our chances to pick up bugs like using unititialised values or accessing memory after it had been freed. Is this possible with a (custom!) darktable build?
Is this possible with a (custom!) darktable build?
You can build with libasan, see:
$ ./build.sh --help
Disclaimer: I haven't tried to reproduce or even read the backtraces provided (honestly, I'm getting the impression that two separate bugs are being discussed at the same time here?).
But just reading about modulegroups and changing images in darkroom reminded me of similar bugs where switching from an image with several instances of the same module to an image that only had one (the base) instance would cause the now deleted extra instances to still be referenced. This could be happening with widgets in QAP for example. If this were the cause, it might be somewhat easier to identify a reproducing set of xmps.
But just reading about modulegroups and changing images in darkroom reminded me of similar bugs where switching from an image with several instances of the same module to an image that only had one (the base) instance would cause the now deleted extra instances to still be referenced. This could be happening with widgets in QAP for example. If this were the cause, it might be somewhat easier to identify a reproducing set of xmps.
That is certainly true of the images involved in my crashes. In one case multiple instances of exposure, in another I have multiple colour balance rgbs. The second (and third sometimes) instance uses a mask. You may be correct but unfortunately it's still not reproducible - I've been back to those images several times trying to reproduce the issue but without success.
Agreed - I didn't mean to clutter this discussion with a different topic. I created https://github.com/darktable-org/darktable/issues/10498 instead.
I have a reproducible issue. It isn't exactly the original issue but I can reproduce it on an otherwise unmodified image (i.e. starting from just the default modules).
This was using my self built version from Friday's git repository... which does contain Pascal's fix. The installer name is 'darktable-3.7.0+1549~g5de65d1fe-win64'
It produces a libgtk access violation on address -1
This does not produce the crash if I select another image in darkroom so that issue is still not reproduced.
Should I raise this as another issue or keep the discussion here?
Do the steps above produce a crash for other people?
I can reproduce a crash using simplified steps:
I get an "g_object_ref: assertion 'G_IS_OBJECT (object)' failed" in gtk_widget_grab_focus(dt_ui_center(darktable.gui->ui));
It looks like possibly the enable toggle button still had the focus even after being deleted and then when gtk tries to switch focus to the central ui, it tries to access the deleted button.
Because if I don't follow these exact steps, for example, I switch to lighttable by double clicking the image instead of pressing the shortcut, there is no crash. In this case, the center already has the focus.
Also, if I create a shortcut to the enable button of the module and use that to toggle it off and on before deleting it, there is no crash.
This may be a completely different bug; what was the error your crash log reported?
@andy-hnq @dterrahe : I can reproduce an issue with your 2 steps by steps and I can confirm that you both hit the same problem. I'll try to have a look...
Sounds like the same issue to me... along with some analysis of the reason. After I'd posted my message I also managed to reproduce it using a second instance of exposure (in addition to the instance created when the image is imported) so the first few steps are not essential and neither is the specific module (color balance rgb).
Sample backtrace attached - sorry should have done this with the original post.
I just had another look at my previous backtraces. I suspect from that the issue reproduced today is a different problem.
All of the other backtraces include _display_module_trouble_message_callback ... both the libmodulegroups and libgtk crashes.
What is a 'module trouble message' please?
Would selecting a new image always call this function or is it only called if there is some 'trouble' with the module?
@AlicVB I'm already submitting a bug for this latest focus issue. I can't/haven't yet reproduced the other one, but it might be easier to pinpoint once this unrelated focus issue is fixed.
Partially answering my own question I went looking for other issues mentioning this function and found a reference to dt_iop_set_module_trouble_message.
Using grep on the source revealed...
$ find . -type f |xargs grep -l dt_iop_set_module_trouble_message
./common/imagebuf.c
./develop/imageop.c
./develop/imageop.h
./develop/pixelpipe_hb.c
./iop/cacorrect.c
./iop/channelmixerrgb.c
./iop/lens.cc
./iop/temperature.c
I think that white balance (aka temperature) is the only one I was using on the images that triggered the crash. I've seen a warning banner in this module when messing with multiple illuminants but that was definitely not happening when the crashes occurred and did not happen on the images involved, wither beforehand or afterwards.
As far as I am aware there were no banner messages in any modules when the crashes occurred... and none beforehand and none afterwards for those images.
I wondered... if it wasn't one of those modules then maybe...
./develop/pixelpipe_hb.c
if(piece->enabled != hist->enabled)
{
if(piece->enabled)
dt_iop_set_module_trouble_message(piece->module, _("enabled as required"), _("history had module disabled but it is required for this type of image.\nlikely introduced by applying a preset, style or history copy&paste"), NULL);
else
dt_iop_set_module_trouble_message(piece->module, _("disabled as not appropriate"), _("history had module enabled but it is not allowed for this type of image.\nlikely introduced by applying a preset, style or history copy&paste"), NULL);
dt_print(DT_DEBUG_PARAMS, "[pixelpipe_synch] enabling mismatch for module %s in image %i\n", piece->module->op, imgid);
}
History copy/paste was definitely happening. I don't know how to cause either of the trouble conditions above so not sure how to test the theory.
@dterrahe : don't bother with a new issue, I have the fix almost ready (just need some tests to be sure that don't break anything ;)
@dterrahe : don't bother with a new issue, I have the fix almost ready (just need some tests to be sure that don't break anything ;)
Sorry, I meant I was submitting a fix (not a bug). Done now. But maybe you have a better/more fundamental solution. In which case you probably also removed the old workaround (without, as I did, explicitly assigning focus).
Hmm... I built again from source just now to pick up the fix to the focus issue.
...another crash I'm afraid :(
I was surprised that my installer had exactly the same name but 'git pull' had definitely done its stuff (I checked the source and the changes are there) and the build had... well... built.
So to make sure I tried the previously crashing (simpler) steps.
This crashes sooner and has a different looking backtrace but I guess that the change has an issue? ... or could it be my build?
Tried twice with the same outcome.
darktable_bt_Z8XRD1.txt darktable_bt_LB6PD1.txt
If I use 'duplicate' and don;t touch the slider then it doesn't crash, even when going back to lighttable
I then started to doubt whether I was moving the slider yesterday or not but I think I was.
So I tried the exact original steps using color balance rgb from my post. These steps also crash on delete... without me touching any sliders etc in the module, just selecting the preset.
EDIT: correct BT for the last crash... darktable_bt_8DTFD1.txt
With current master seems I can't reproduce using your steps.
I can't reproduce too with current master (after @dterrahe commits)...
Can you by any chance try to compile with only the first fix ?
Something like : git checkout c359de81de9d1a6bdc6b605850ed891e562ab577
+ all your compile steps
Thanks !
OK so (before @AlicVB's post) I thought I would try a clean build. The build notes don't say how to do this so in the end I wiped the build directory completely and started from the cmake -G "MSYS Makefiles" ... step
I also did another git pull - it said I was up to date.
The build took about an hour so was definitely doing more.
Uninstalled and then installed the new one.
Same issue still :(
Another thought (on repeatability)... The build notes say
NOTE: The package created will be optimized for the machine on which it has been built, but it could not run on other PCs with different hardware or different Windows version. If you want to create a "generic" package, change the first cmake command line as follows:
Not sure how it is optimised for my machine or whether this might make a difference. Should I do the 'generic' thing or do the optimised thing? I guess keep it optimised otherwise I will be changing two things at once.
It's getting late for me now. I will try the checkout and another build tomorrow.
Build done from 'git checkout c359de81de9d1a6bdc6b605850ed891e562ab577'.
No crashes from my original steps nor on delete :)
I used the original build steps, so still 'optimised' (whatever that means).
Thanks for all your work trying to reproduce the crash. It can be very tedious but is very helpful. The devs can't catch all bugs themselves (they tend to not show up in the workflows they've been tested on during development because then they would already have been fixed) so volunteers taking the trouble to file bugs are essential. Thanks again!
@dterrahe : so the simplest fix is to revert your second commit, imho... what do you think ? this one : https://github.com/darktable-org/darktable/commit/9d8d38674e1b57b9b741e07e260526679f3e3e7f
what do you think ?
I got the impression from the latest message from @andy-hnq that he no longer is seeing the problem, so nothing to fix. Did I misread that? Do you still see crashes yourself?
well, I may very well be wrong , but I've read :
Build done from 'git checkout c359de8'. No crashes from my original steps nor on delete :)
So for me dt compiled at this commit doesn't crash, contrary to master... @andy-hnq : who is right ? :)
And no, I don't see any crash or suspect msg here with master...
The crashes only stopped by going back to that specific commit.
No idea why I see them and you folks don't. Sorry.
Ah, sorry, yes I did misread/not check which commit was being tested. So yes, reverting the second part would be the easy "fix" for now. @TurboGit you may want to do this now so we get more testing on that.
Still strange that it causes crashes for him but not us. Maybe a gtk version issue.
Introduction
Describe the bug/issue
To Reproduce
This seems to be intermittent but I've had several crashes over the past few days. I've had it when switching to a new (.CR2) image in darkroom mode, both when using the thumbnail strip at the bottom and when using the spacebar to move to the next image.
Expected behavior
Should move to the selected image without crashing
Which commit introduced the error
Installed 3.6.1 using darktable-3.6.1-win64.exe downloaded from darktable.org on 2nd Nov. Prior install was 3.4 which was uninstalled using control panel before installing 3.6.1
Platform
Additional context
darktable_bt_TTJ7B1.txt darktable_bt_QRMJC1.txt darktable_bt_H54IC1.txt