Closed yingziwu closed 1 year ago
Well this is a weird one! I can duplicate it. But only with this epub... and only with the repo version of Sigil. Works fine on Windows too. I'm going to have to go out on a limb and say Arch has changed their default cflags for compiling packages and broken something.
Nothing has been changed in our Gumbo implementation (which seems to be the culprit here) in two years.
Though to be fair, this epub might have broken earlier repo versions of Sigil as well. Can't be sure. I'll have to investigate more this weekend.
The trouble seems to come when trying to remove a Nav landmark from the landmarks section of the Nav that was associated with the just deleted xhtml file.
Perhaps, the Nav was not properly built in the first place?
I will try to recreate this "abort due to an assert failure" on macOS as well.
I'll have to make sure I was testing with other Epub3s, then. But as mentioned, even this same epub doesn't crash for me when using a version of Sigil I compiled myself on Arch (or when using the official version of Sigil on Windows). Only the Arch repo-supplied version of Sigil crashes when deleting xhtml from this particular epub (for me).
This wouldn't be another case like the inline static constants fiasco related to -flto would it? I need to find out if the default CFLAGS for building community packages on Arch has changed.
Unless asserts are disabled on other platforms, then probably a cflags change, compiler/optimization bug most likely.
Tested on macOS and no assert failure, no crash. Even ran it inside lldb and no warnings, no abort. Seems to work like a charm.
So Arch have either patched sigil, or used an external gumbo build instead of our internal one (which will always fail), or changed the cflags used for the gumbo or Sigil builds, or just are using a broken compiler/optimization combination.
It seems to be only reproducible with the Arch internal build. I am going to try with my own Manjaro VM builds just to confirm.
Tested my own builds on Manjaro Linux of Sigil 1.9.30 and no crashes at all. Could not access Sigil-1.9.30 on Manjaro as it is still in "testing" so grabbed their official Sigil-1.9.20 distribution build. It aborts with the same assert failure for this sample epub.
So whatever change Arch made to their builds happened in Sigil-1.9.20 or earlier as their official build shows this error by my own builds do not.
So almost definitely a compiler/optimization issue or cflags issue in Arch/Manjaro.
so grabbed their official Sigil-1.9.20 distribution build. It aborts with the same assert failure for this sample epub.
I wondered about that. Thanks for confirming.
Unfortunately, I can't find anything in the history of Arch's makepkg.conf that would indicate any relevant changes since the addition of -flto=auto a year or so ago.
Whatever it is may have been there for quite awhile. Just needs the right epub to expose it. shrug
The bug itself is a compiler bug as far as I can tell. The bug is not in gumbo. The bug is in ResourceObjects/NavProcessor.cpp in the code that tries to update the Nav's landmarks section after deleting a file that could have had a landmark set.
Please try this ...
When I try this on Manjaro (distribution build) I get no abort at all.
So something in the below code while loop is not properly detecting that the pos is -1 (that resource's href does not exist in landmarks) and causing the abort trying to remove something at pos = -1, but it will detect that the landmark's list is empty preventing the bug:
void NavProcessor::RemoveLandmarkForResource(const Resource * resource)
{
QList<NavLandmarkEntry> landlist = GetLandmarks();
QWriteLocker locker(&m_NavResource->GetLock());
int pos = GetResourceLandmarkPos(resource, landlist);
while((pos > -1) && !landlist.isEmpty()) {
landlist.removeAt(pos);
pos = GetResourceLandmarkPos(resource, landlist);
}
SetLandmarks(landlist);
}
int NavProcessor::GetResourceLandmarkPos(const Resource *resource, const QList<NavLandmarkEntry> & landlist)
{
QString resource_book_path = resource->GetRelativePath();
for (int i=0; i < landlist.count(); ++i) {
NavLandmarkEntry le = landlist.at(i);
QString href = ConvertHREFToBookPath(le.href);
QStringList parts = href.split('#', QT_ENUM_KEEPEMPTYPARTS);
if (parts.at(0) == resource_book_path) {
return i;
}
}
return -1;
}
My guess is a C++ compiler bug that happens because of some specific optimization level or optimization switch they have enabled that we do not.
Have you been able to reproduce with any other epubs yet?
No but that epub3 does in fact pass epubcheck so I do not think anything is technically wrong with it.
According to my /etc/makepkg.conf file on my just updated Manjaro build, here is what the builders are using:
CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions \
-Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security \
-fstack-clash-protection -fcf-protection"
CXXFLAGS="$CFLAGS -Wp,-D_GLIBCXX_ASSERTIONS"
LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now"
LTOFLAGS="-flto=auto"
I wonder if we try the same CFLAGS and CXXFLAGS and LTOFLAGS if we could recreate the bug in our own builds with this problem epub?
Might be worth a shot?
Yes, it's certainly possible. I was able to replicate the lto issue a while back by using the same flags from the makepkg.conf. You just have to remember that the LTO flags get appended to the cflags if I recall correctly.
On Fri, Mar 17, 2023, 12:20 PM Kevin Hendricks @.***> wrote:
No but that epub3 does in fact pass epubcheck so I do not think anything is technically wrong with it.
According to my /etc/makepkg.conf file on my just updated Manjaro build, here is what the builders are using:
CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions \ -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security \ -fstack-clash-protection -fcf-protection" CXXFLAGS="$CFLAGS -Wp,-D_GLIBCXX_ASSERTIONS" LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now" LTOFLAGS="-flto=auto"
I wonder if we try the same CFLAGS and CXXFLAGS and LTOFLAGS if we could recreate the bug in our own builds with this problem epub?
Might be worth a shot?
— Reply to this email directly, view it on GitHub https://github.com/Sigil-Ebook/Sigil/issues/711#issuecomment-1474082627, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG3CXWAG7CEWBDY6KLHOITW4SFMPANCNFSM6AAAAAAV6GCVMI . You are receiving this because you commented.Message ID: @.***>
Can you try my test with a failing distro sigil version that deletes the nav landmarks completely first and see if the abort happens on not. In my tests it will not which in fact indicates some miscompilation of the code from the NavProcessor.
It will have to be a little later. But yes. I'll try it when I get back to my dev machine.
No problems. I bet some optimization is reordering the inside of the while loop. That should not happen but seems to be what I am seeing. I have no idea why this would happen only with this epub though.
It is interesting that if I open the nav.xhtml in the problem epub in CodeView and then run Mend on just its contents, all of the aborts go away when I try to delete any xhtml file.
So there may be something incorrect in the nav.xhtml file in the form of nesting in this problem epub that is causing the problem.
I'm just glad it wasn't something introduced with 1.9.30. Also that it wasn't related to Wayland, or Arch's zen kernel.
Is it just the fact that the entire nav is one line of code? Not that that explains why it only faults with the Arch Sigil package.
Not sure. I tried running Mend on the nav.xhtml and no more aborts. I tried running Mend and Prettify on the original nav.xhtml and no more aborts. I tried running Generate Table of Contents on that epub and no more aborts.
So whatever is causing the aborts has to do with something strange about the nav.xhtml file that regenerating it fixes.
Yes, I think the issue is that the nav.xhtml is just one big line. Still not sure why but I think the one huge line is interfering with extracting just the nav landmarks set properly. I have no idea why that is only happening with distro builds and not our own builds.
Strange.
Maybe some sort of reentrant limit that's only getting hit under certain optimization settings? It's definitely a weird one.
On Fri, Mar 17, 2023, 1:47 PM Kevin Hendricks @.***> wrote:
Yes, I think the issue is that the nav.xhtml is just one big line. Still not sure why but I think the one huge line is interfering with extracting just the nav landmarks set properly. I have no idea why that is only happening with distro builds and not our own builds.
Strange.
— Reply to this email directly, view it on GitHub https://github.com/Sigil-Ebook/Sigil/issues/711#issuecomment-1474196613, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG3CXW6PLYGLCMFLRV2RYDW4SPTRANCNFSM6AAAAAAV6GCVMI . You are receiving this because you commented.Message ID: @.***>
Didn't we add some sort of performance fix that chopped monolithic xhtml into reasonable chunks that would perform better?
Actually, I think the bug is here and the issue is in optimized mode asserts are being ignored (they only fire in debug mode) but the new CXXFLAG on glibc asserts actually allows it to fire.
https://github.com/Sigil-Ebook/Sigil/blob/master/internal/gumbo/gumbo_edit.c#L170
The problem here is that we delete the last gumbo node and then want to replace it but when the nav is one single line there is no extra whitespace nodes that hide this test failure.
The most you can insert is one child so this assert should be changed to <= instead of just less than to be perfectly correct when no extra child whitespace nodes are around (the entire nav.xhtml is just a single line).
We should make that change for the next release, since most platforms only assert when in debug mode, we just do not see the misfiring assert.
So the original backtrace was correct.
I will push this fix to master and we can test it by adding this to our CXXFLAGS:
-Wp,-D_GLIBCXX_ASSERTIONS
and then trying this testcase which seems to be the only way to tickle this bug.
To test on macOS, I change the assert in gumbo_edit.c in line 170 to force a call to abort() if the test failed and sure enough, we can then see the abort on both our own Linux builds and macOS with this test case.
That assert is simply incorrect and I have fixed it in master and tested and we are now good to go.
We will NOT need a new release for MacOS and Windows as that bad assert will not fire in a release build.
As for Linux, we could create a Sigil-1.9.31 or just ask them to include a single commit or just wait until next release as the only way to tickle this bug is to have one single line nav.xhtml file with the landmarks section coming last. That should be pretty rare as this bug has existed in earlier releases.
I think we can close this one as "fixed in master".
I'm not going to worry overly about getting a new release out (or asking repos to cherry-pick). We have a workaround if it ever pops up again before the next release.
And agreed... "Fixed in Master" it is. Thanks for the bug report-it was a weird one!
Bug Description
sample: [苍穹爱丽丝]来自星空的铁翼天使-epub.zip
Platform (OS)
Linux
OS Version / Specifics
archlinux 6.2.6-zen
What version of Sigil are you using?
1.9.30
Any backtraces or crash reports