Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.88k stars 573 forks source link

[Bug]: Sigil runs out of memory / takes too long when restructuring epubs with thousands of images #747

Closed KarlG1965 closed 5 months ago

KarlG1965 commented 6 months ago

Bug Description

When I try to restructure epubs using Sigil's built in function, if the epub has a lot of images Sigil either crashes or can take up to 30 - 40 minutes to complete

Platform (OS)

Linux

OS Version / Specifics

Xubuntu 23.10

What version of Sigil are you using?

current from github

Any backtraces or crash reports

I can attach one of the epubs where I am having issues, but I'm not sure if this would be classed as piracy or not.
dougmassay commented 6 months ago

How many images are we talking here when you say 'a lot'?

KarlG1965 commented 6 months ago

In this particular instance, exactly 7336 images. All, apart from about 3, are png files about 3k in size. (It's a statistics book, hence the vast number of images of mathematical formulas)

dougmassay commented 6 months ago

Do all the images happen to be in one (or very few) xhtml file(s)? Does the epub behave sluggishly in Sigil before the attempt to restructure?

KarlG1965 commented 6 months ago

The epub is 48.7mb in size. There are a total of 29 xhtml files, 20 of which are chapters (which I assume will be where most of the images are linked).

If I open the epub it reacts normally. I can open different chapters and scroll through them no problem. If I restructure the HTML it takes about 10 seconds.

Saving the epub takes about 50secs - 1 minute on my system (just with reformatted HTML, not restructured)

I notice that just about all linked images have inline css

e.g.

\ \<img alt="$$\frac {1}{2}(yield{11}+yield{10})$$" src="../images/519209_1_En_12_Chapter/519209_1_En_12_Chapter_TeX_IEq294.png" style="width:8.68em"/>

Don't know if that's maybe causing issues as well.

dougmassay commented 6 months ago

The one minute save time is very excessive, but that's something entirely different.

That alt property string is pretty terrible. Are they actually trying to put a mathml formula in there? That's very ill advised.

KarlG1965 commented 5 months ago

I don't know WHAT they are trying to do, to be honest! :-) The book is from Springer, and in general they are VERY badly coded.

This is one of the joys of Sigil. I can open the book and remove all the unnecessary garbage!

kevinhendricks commented 5 months ago

Could your machine be running out of memory? How much memory does it have?

Linux never seems to handle running out of memory in a reasonable way, normally willy-nilly killing processes left and right? Is your machine set to use virtual memory/swap. Lots of distribution installers never bother to create swap partitions (or swap files) and never properly use swapon.

kevinhendricks commented 5 months ago

Will you please enable saving crash images and use gdb to get a backtrace from it (or alternatively run Sigil inside of gdb by tweaking the sigil launch script and generate a backtrace that way) and post it here?

There is also a Sigil plugin that will replace all the text of the epub with jibberish. You could try running it on a copy of that problem epub. Then saving the result.

And attach that jibberish version here by zipping it up and attaching it to this issue (or post a link to it for us to grab).

kevinhendricks commented 5 months ago

The plugin is called the Borkify Plugin and it is available in Sigil's Plugin Index on our Mobileread forum.

kevinhendricks commented 5 months ago

FWIW, even worse, that alt string is a attempt to use TeX/Ascii Math not actual mathml.

kevinhendricks commented 5 months ago

Does your book have any javascripts (.js) files or inline javascripts being used?

If so, restructuring it to Sigil norms is very ill-advised. Sigil can not update any links to resources inside a javascript and most javascripts are not designed to be relocated.

If so, that may be what is causing the crash. We may need to validate that no javascripts exist in the epub before allowing Restructure to Sigil norm to be run.

KarlG1965 commented 5 months ago

Could your machine be running out of memory? How much memory does it have?

Linux never seems to handle running out of memory in a reasonable way, normally willy-nilly killing processes left and right? Is your machine set to use virtual memory/swap. Lots of distribution installers never bother to create swap partitions (or swap files) and never properly use swapon.

My machine has 16GB of RAM. Not really something I considered, but I know what you mean about Linux and memory management.

The plugin is called the Borkify Plugin and it is available in Sigil's Plugin Index on our Mobileread forum.

Just tried to install it, but it's giving me an error about missing Python 3.4 / 2.7. I need to deal with a few other things at the moment but I'll get back to this and figure out where it's setting the Python version and change it to 3.11 (which I'm using now)

kevinhendricks commented 5 months ago

No need to change anything in that plugin. Those are minimum required Python versions. Python 3.11 is fine.

Instead have you installed the recommended python modules Sigil needs to run properly?

Try running the latest version of the testplugin that will check for all the required pieces being available.

Grab it from here:

https://github.com/Sigil-Ebook/Sigil/tree/master/docs

You want testplugin_v020.zip

dougmassay commented 5 months ago

I also have some self-contained Python AppImages that have everything a Sigil plugin might require already included. You're welcome to us one if it simplifies things. They're a bit large (since they're self contained), but they should work out of the box. Just download the one for the Qt version your Sigil uses (Qt5 or Qt6); put it somewhere safe; make sure it's executable; and use the Sigil plugin preferences dialog to select it as the Python interpreter to use for plugins.

https://github.com/dougmassay/appimage-sigil-python/releases/tag/2023.11.2-1

KarlG1965 commented 5 months ago

No need to change anything in that plugin. Those are minimum required Python versions. Python 3.11 is fine.

Instead have you installed the recommended python modules Sigil needs to run properly?

Try running the latest version of the testplugin that will check for all the required pieces being available.

Grab it from here:

https://github.com/Sigil-Ebook/Sigil/tree/master/docs

You want testplugin_v020.zip

It was even simpler than that! I didn't see the path variable at the top of the plugin window. The path was empty, which was why I was getting the 'no path to 3.4' error!

facepalm

KarlG1965 commented 5 months ago

Does your book have any javascripts (.js) files or inline javascripts being used?

If so, restructuring it to Sigil norms is very ill-advised. Sigil can not update any links to resources inside a javascript and most javascripts are not designed to be relocated.

If so, that may be what is causing the crash. We may need to validate that no javascripts exist in the epub before allowing Restructure to Sigil norm to be run.

No, no JavaScript

KarlG1965 commented 5 months ago

Looks like I'm even having issues uploading the split files. Any suggestions where I can upload this?

kevinhendricks commented 5 months ago

Unfortunately no. Most file sharing services are just excuses to place malware on your machine.

Are these 3 pieces all of it?

KarlG1965 commented 5 months ago

https://drive.proton.me /urls /279G4GQ4X8#dyOGojuypiF4

If you add these together, you can download from proton

kevinhendricks commented 5 months ago

Okay, I grabbed it form that link. I will test restructuring with it and get back to you.

KarlG1965 commented 5 months ago

Ok great, I'll remove the share then

kevinhendricks commented 5 months ago

Okay, my restructure is still running. It may eventually run out of memory on my machine too (I have 32 Gigs).

That said, this epub has 7330 png images, 4 jpeg images and 1 gif image which is absurd. They have images of each letter of the alphabet sometimes repeated in different folders which represent different chapters.

So my guess is this started out as each chapter representing its own book and no one bothered to remove redundant images, they just threw the thing together. That does not even consider all of the images that represent more than one character or symbol (ie. an equation).

I tried running a Reports Tool on it just to get some counts and it took forever to process and that uses up to 20 worker threads.

I have no idea what is taking all of the time so far so I will have to randomly interrupt it just to look at the most common backtraces.

kevinhendricks commented 5 months ago

And after each move we realunch python3lib code to clean up any issues with the OPF being rebuilt. As one point it actually starts recursing and that is what makes it run out of memory.

This one will be a doozy to fix and we will need some type of block "move" code that waits until all of the moves are done and then updates the opf. But that means that at some point the opf manifest will be incorrect (not match reality) and that will cause a big problem.

So not a crash, a recursive out of memory state caused by the extreme use of image files (over 7330) of them which caused the opf to need to be updated 7330 times, which in turn causes all of the xhtml files to be updated 7330 times in order to search for and update any links.

This is not something we can address easily.

So I recommend not using reformat to Sigil Standard on this particular epub until a more efficient kind of bulk update process can be figured out.

KarlG1965 commented 5 months ago

As a workaround, I tried to export all the images and then remove them with the idea of doing the restructure and then adding the images back, but sigil crashed on me whilst deleting all the images.

It certainly sounds as if it will be a lot of work, but I'm glad you know about the issue now.

Regarding saving after reformatting the html, I think you mentioned that that shouldn't take as long either.

Maybe a separate issue?

kevinhendricks commented 5 months ago

Crashed or again just ran out of memory. I am able to delete all 7330 images in one go.

KarlG1965 commented 5 months ago

Not sure, tbh. Just know it didn't work, then I gave up :-)

BeckyDTP commented 5 months ago

Perhaps I'll add my two cents. I downloaded the sample file and no crash occurred with me. But I did "Mend and Prettify" immediately after opening, and only then did I enable "Restructure Epub to Sigil Norm". Sigil stopped responding (No Responding message in the title bar), but I didn't bother it and after a while the job was done.

My system: Windows 10 Pro, 32GB of memory, i7-11700 processor.

kevinhendricks commented 5 months ago

If I wait long enough it finishes on mine too. The problem is that for 7330 images, each one being moved, we end up moving them one by one which means we edit the opf 7330 times and each time it launches a python instance to parse and check the entire opf, and then each time it must parse the opf again to change a single line in the manifest and restructure it and rinse and repeat.

So creating a bulk update to the opf makes more sense and should make it much more efficient.

kevinhendricks commented 5 months ago

Okay, I have been working all evening on a Bulk Resources Update for the OPF and I have something working that reduces the time for a Restructure To Sigil Norm on that book to be about 40 seconds or less.

I then checked the time to delete all selected images (all 7330 of them) and it again suffered from the delete one by one causing repeated updates to the OPF. Luckily a BulkRemove for resources already existed and is used to speed up merging lots of files. I was able to change Sigil to use it anytime more than 50 files are being deleted. It sped it up to about 30 seconds to select all 7330 images and then delete them (with an much lower memory footprint to boot).

I then checked the time to save the epub, and it does not take a minute on my machine at all. But I have all ssd drives and a fast machine. I do not think there is any viable way to copy almost 7500 files and over 48 meg (compressed) including compressing them all in any less time.

I will wait until things settle down and we are sure a follow-up 2.1.1 is not needed and then I will commit the changes to speed things up (and greatly reduce memory consumption) got both Restructure to Sigil Norm, and for Deleting thousands of files at a time.

It is an interesting test case to be sure.

kevinhendricks commented 5 months ago

Okay, I did some polishing this morning and timed a full restructure to sigil norm at 12 seconds. Then timed a full save-as and it took 6 seconds.

Similar improvement in deleting thousands of images at a time. I consider this one done.

I will keep this open to remind me to push this to master when we are sure this new release is complete.

KarlG1965 commented 5 months ago

Okay, I did some polishing this morning and timed a full restructure to sigil norm at 12 seconds. Then timed a full save-as and it took 6 seconds.

Similar improvement in deleting thousands of images at a time. I consider this one done.

I will keep this open to remind me to push this to master when we are sure this new release is complete.

12 seconds? SOO LONG?? :-p

That sounds fantastic. Really looking forward to trying this out.

Thanks Kevin :-)

kevinhendricks commented 5 months ago

Since Sigil 2.1.0 appears quite stable with no showstoppers, we are reopening the tree for the next release. I have just now pushed fixes for your issue to Sigil master.

So I am now closing this as fixed.

Thank you for your bug report.