edrlab / thorium-reader

A cross platform desktop reading app, based on the Readium Desktop toolkit
https://www.edrlab.org/software/thorium-reader/
BSD 3-Clause "New" or "Revised" License
1.82k stars 154 forks source link

Missing option to opt-out of bookshelf feature #1119

Closed PhilippeBruno closed 2 years ago

PhilippeBruno commented 4 years ago

For those who manually manage theirs eBooks themselves in a dedicated folder structure on their computer or in the cloud, their should be an option to opt-out of the bookshelf feature. As I opened a large number of ePubs to test the application, I saw the available space on my hard disk decrease considerably. When I inspected the "C:\Users\%username%\AppData\Roaming\EDRLab.ThoriumReader" folder, I realised it nearly reached 1 GB! Fair enough, as I deleted the ePubs from the bookshelf, the size of the above folder decreased, but no one could seriously keep deleting every opened ePub when done reading it!

Why is it that so many ePub readers want to manage ebooks in users' stead? I think it all started with Calibre, but this is not a way to go. I believe it has to be optional.

danielweck commented 4 years ago

Is this with the official Windows Store release, or the official v1.3 installer, or the automated build from the develop branch?

Irrespective of the answer to the above question, could you please check the size of the following folders within the root Thorium folder:

1) publications 2) db 3) db-dev 4) db-dev-sqlite

The reason I am asking is because we identified a bug with an older version of Thorium where the size of the db-dev folder was growing indefinitely. We fixed this by using a different database backend for the automated builds, and for the local developer builds. This “out of control” database size problem does not occur with the official releases, as far as we know.

Thank you!

PhilippeBruno commented 4 years ago

Wow, I feel I am part of the family now with all this quick back-and-forth! ;)

Ok, basically, I had the Thorium from the Windows Store installed for a while, but several months ago, I ditched it because it was lacking the search feature (I am doing my Ph.D. and this is an absolute must for me). This morning, I decided to give a try to what appeared to be the latest version available (Thorium_Setup_1.3.1-alpha.1.138.exe) from https://github.com/edrlab/thorium-reader/releases/tag/latest-windows and uploaded by you less than a day ago.

I will check the size of the folders as soon as I have a minute and report back to you.

danielweck commented 4 years ago

uploaded by you less than a day ago

Just a clarification: successful build artefacts are automatically uploaded by continuous integration servers (Travis and Appveyor), whenever there is a commit/push to the develop branch. As it happens I recently updated code in this Git branch, thus the new build :)

PhilippeBruno commented 4 years ago

Hi Daniel,

Oddly enough, I could not find the folders publications, db, or db-dev on my computer. I only found db-dev-sqlite and a publications-dev folders (could it be related to the fact I am using Thorium_Setup_1.3.1-alpha.1.138.exe?

Anyway, here are the sizes of

Keep in mind that I keep deleting ebooks from the main screen of Thorium as I simply do not want Thorium to manage ebooks for me.

Also, even with all the ebooks deleted from Thorium, the entire EDRLab.ThoriumReader folder is still 56 MB (80 files, 28 folders).

Going back to my original question, could someone explain to me the rational behind the concept of "managing" ebooks on behalf of the users? I never liked Calibre precisely for that reason and also because it kept modifying the EPUBs by adding some lines to content.opf, a cover of its own and bookmarks among other things. When I buy an EPUB or a PDF, I expect it to stay unchanged in its original state.

Anyway, I am very curious about this design principle (managing the ebooks on behalf of the users that is) and would be immensely grateful to anyone who would care to explain to me the rational behind this.

Finally, is there anything other than a philosophical perspective that would prevent Thorium from offering the option of working as either an integrated solution or simply as a reader for those like me who prefer to manage their ebooks themselves.

danielweck commented 4 years ago

Yes, publication-dev is indeed the correct folder when db-dev-sqlite is in use. Otherwise (i.e. production builds), the folder names are publication and db.

The SQLite3 database is indeed larger than we would like, but it is not unusual for DB backends to preserve some historical information about DB transactions in order to restore a previous valid state in case of fatal error / data corruption. Thorium actually instructs the database component to clear its backup copies, and I personally checked the SQLite3 records to make sure that the data does indeed not grow indefinitely. However I did notice some residual data, so there is still some overhead despite flushing snapshots when Thorium closes.

danielweck commented 4 years ago

Thorium, like several other ebook readers / reading systems (e.g. MacOS Books.app / iBooks) stores imported publications in its own filesystem space, as this provides a guarantee that per-publication state / external metadata (e.g. bookmarks, settings, annotations, DRM status, etc.) can be reliably attached.

Conversely, the Readium2 “test app” is a basic reader-only app (i.e. no library / bookshelf) which provides rudimentary support for ebook features (bookmarks, highlights), with a crude user interface designed for testing the SDK. You may give it a try using the automated builds from GitHub’s releases page: https://github.com/readium/r2-testapp-js

danielweck commented 4 years ago

...and just to be clear: Thorium does not alter the contents of imported publications, except for the very particular case of updated DRM licenses (LCP).

PhilippeBruno commented 4 years ago

Hi Daniel,

Thanks for your detailed replies. I really appreciate it.

...and just to be clear: Thorium does not alter the contents of imported publications, except for the very particular case of updated DRM licenses (LCP).

One of the first things I verified with Thorium was making sure it did not modify files by doing byte level comparisons between the original and the stored files. Thanks for confirming.

PhilippeBruno commented 4 years ago

Conversely, the Readium2 “test app” is a basic reader-only app (i.e. no library / bookshelf) which provides rudimentary support for ebook features (bookmarks, highlights), with a crude user interface designed for testing the SDK. You may give it a try using the automated builds from GitHub’s releases page: ...

Wow, that's not bad at all for a crude GUI! I kind of like it! Lacking some basic features, but seriously, it is not bad at all. Thanks for sharing this.

PhilippeBruno commented 4 years ago

@danielweck I did some more tests with Thorium (not the Readium2 test app) and I found that the database keeps growing even after I delete all the EPUBs in Thorium.

Initial size of C:\Users\%username%\AppData\Roaming\EDRLab.ThoriumReader folder prior to launching this test with 0 EPUB in Thorium database: 56.0 MB (58,823,312 bytes)

Size of C:\Users\%username%\AppData\Roaming\EDRLab.ThoriumReader folder after opening 10 different EPUBs (combined size of 122 MB or 128,732,932 bytes) with Thorium: 182 MB (190,920,062 bytes)

We can see that Thorium adds an overhead of (190,920,062 - 58,823,312) - 128,732,932 = 3,363,818 bytes or roughly 2.6%.

Size of C:\Users\%username%\AppData\Roaming\EDRLab.ThoriumReader folder after launching this test with all EPUBs deleted from Thorium database: 56.1 MB (58,868,976 bytes)

We can see that there are some residual data left in the database after the deletion in the order of 58,868,976 - 58,823,312 = 45,664 bytes.

I then repeated the experiment 2 more times and got odd results for the third round:

image

where

A residual data size of 40 to 45 kB (round 1 and 2), although not correct, is not critical. A residual data size of 3.5 MB (round 3) is alarming as this could be the size of an EPUB.

One would expect the size of EDRLab.ThoriumReader folder to return to its initial size after deleting all the EPUBs consulted during a work session.

Given the fact that I open and close easily one hundred EPUBs a day for my thesis, we could be looking at considerable wasted space after a little while. Moreover, with SSDs becoming more prevalent, this also adds considerable wear and tear to the hard drive.

Whilst some people will be perfectly content with Thorium managing the EPUBs for them with all the very valid hereinbefore mentioned explanations, some people like me who need to sort and store EPUBs in various folders on their hard drive, along with PDFs as they pertain to different topics of a research, need a way to use Thorium as a reader only. In my view, the simplest way to implement this is to disable automatic import of EPUBs when double clicking one in File Explorer (or at least give the option of turning that off). This way, Thorium could manage imported books for the vast majority of people who need that feature and allow researchers and other users who need to use Thorium as a reader only to enjoy beautiful rendering of EPUBs without the trouble of deleting EPUBs from Thorium regularly and the unnecessary wear on their SSDs.

danielweck commented 4 years ago

Thank you for your thorough analysis :)

The SQLite3 database backend for PouchDB is not directly interacted with in Thorium. Instead, we invoke DB functionality via a layer of abstraction. Notably, we call compact() when the application exits, which is supposed to remove unnecessary “backups” in the DB records:

https://github.com/edrlab/thorium-reader/blob/952d899955d7e4450ef63f9286bff07fc91081b2/src/main/di.ts#L160-L174

I checked the actual contents of the SQLite files in db-dev-sqlite (using a GUI utility) and I could indeed confirm the data removal, but I was also disappointed to observe remains of history-related information, which probably accounts for the DB growth over time.

PS: the LevelDown DB backend is used instead of SQLite3 for production builds (i.e. development-mode npm run start and the official Thorium releases). I didn’t check the actual contents, but what is notable about LevelDown is that several DB files are created (unlike SQLite which records data into a fixed number of files)

PhilippeBruno commented 4 years ago

@danielweck

Thorium, like several other ebook readers / reading systems (e.g. MacOS Books.app / iBooks) stores imported publications in its own filesystem space, as this provides a guarantee that per-publication state / external metadata (e.g. bookmarks, settings, annotations, DRM status, etc.) can be reliably attached.

I have been giving some further thoughts, reflecting on your input to my suggestion of allowing users to opt-out of the bookshelf feature, and I came to the conclusion that Thorium should have two modes of operations built-in à la Adobe Digital Editions (ADE). Yes, I fully agree with you regarding the need to guarantee publication integrity and all in a way somewhat similar to what ADE is doing with DRM EPUBs, annotations, etc. However, instead of automatically importing an EPUB when one is double-clicked, Thorium should simply open it, provided there are no DRM prohibiting this operation, without of course the additional benefits of bookmarking, annotating, etc. just like what ADE is doing. When closing the application, a prompt could offer the user a chance to import the book or not. The display or not of this prompt should be a user configurable option.

Don't get me wrong, I do not suggest to turn Thorium into a glorified bugfree ADE. I am merely suggesting to include some useful features of one product into Thorium. Figuratively speaking, just because petrol cars had a radio and a heater did not mean electric cars could not share the same useful basic features!

Lastly, I conducted a very informal and somewhat very limited survey last night, talking to some fellow graduate students regarding scientific literature in the form of electronic books (scientific articles are still almost exclusively in PDF format): the fixed layout EPUB format is a real nightmare. Traditionally, mostly novels were offered in EPUB format whilst scientific and academic books were in PDF format. With the advent of the fixed layout EPUB format, a growing number of publishers are turning to EPUB 3 for books other than novels but the lack of reliable readers in the Windows world is causing some commotion. For as long as Microsoft sold EPUBs, the giant offered a very capable EPUB reader in the form of Edge. Unfortunately, Microsoft stop selling EPUBs and when they decided to switch the engine of the new iteration of their Edge browser to Chromium, they dropped support for EPUBs. With no more EPUBs to sell, Microsoft had no incentive to implement the EPUB reader feature in the new Edge. Some of us have switched to Apple for the very reason it supports fixed layout EPUBs natively. However, many students cannot afford switching to Apple in the middle of graduate studies and others simply prefer to stick to Windows for other reasons like this guy who was telling me last night that he has both a Mac and a Windows machine, but whilst he is reading his EPUBs on his Mac, he is writing his thesis on Windows because his reference manager software (Endnote) does not work reliably on his Mac and support is practically unavailable for this platform. I, myself, tried dozens of EPUB readers when Edge stopped supporting EPUBs. Believe it or not, for fixed layout, I had no choice to use Lithium on my Android tablet to open this growing number of EPUB 3 ebooks I needed to consult because on my Surface Pro, I was out of acceptable options until I revisited Thorium. Like I mentioned in another thread, I had tried an early version of Thorium but the lack of a search feature was a turn off for me.

In my opinion, Thorium is the best option for reliably reading EPUBs on Windows computers, provided much needed features like "search" are implemented soon. There is a need to manage DRM, annotations, bookmarks, etc., but there is also a desperate need for simply quickly opening EPUBs without importing them into a database... and this is not luxury for all those graduate students out there.

danielweck commented 4 years ago

We received another feature request (via private email) to load EPUBs directly in the reader window, not adding them in the library window / local bookshelf.

PhilippeBruno commented 4 years ago

This is indeed probably the single biggest pain with Thorium right now. If I could upvote that feature request I would, so consider this message as an "upvote" ;)

danielweck commented 4 years ago

Very useful for temporary "proof reading" / testing of ebooks during the production workflow. A little hard to implement due to architectural constraints (as mentioned in previous comments), but we can probably figure something out (I am thinking about storing the EPUB in the database as usual, but flagged as "temporary", invisible in the bookshelf, and resulting in the publication being automatically removed from the DB once the application closes).

danielweck commented 4 years ago

@panaC what do you think about my "temporary" storage idea (comment above)?

PhilippeBruno commented 4 years ago

@danielweck That's pretty much the workaround I currently implemented using AutoHotkey. When I close the application, I send keystrokes to delete the book in the bookshelf so that it does not get cluttered.

Also, to add to the use cases or justifications you mentioned above, a growing number of documents is distributed in EPUB format. Let's say I want to keep the received EPUB in a folder on my computer along with other documents (just like I can freely do with PDFs), why would Thorium take the liberty of making a copy of that document in its own database?

danielweck commented 4 years ago

Also, to add to the use cases or justifications you mentioned above, a growing number of documents is distributed in EPUB format. Let's say I want to keep the received EPUB in a folder on my computer along with other documents (just like I can freely do with PDFs), why would Thorium take the liberty of making a copy of that document in its own database?

Many (most?) reading systems store documents/publications in their own "database" (the term is used loosely to reference a persistence method for the publications artefacts themselves, not for the associated settings). I believe the primary reason is that the user's filesystem is not a reliable location (i.e. potentially unstable, files get moved around, network shares get disconnected), for the purpose of creating robust associations between publications and their bookmarks/annotations/DRM status/user settings/etc. Publication identifiers / ISBNs / etc. are unfortunately not reliable either (especially with in-progress publications during the production workflow, but even sometimes with public domain ebooks). This is why Thorium computes CRC checksums to create reliable identity markers in the database.

However I totally agree that the use case consisting in "temporarily" reading a publication is very much a valid one. In fact, in the r2-testapp-js there is no library/bookshelf window, only a reader view (leaner app => faster launch). Settings are stored in a map key'ed by filesystem location, so if publications move / disappear, the settings become orpheaned. But that's an acceptable tradeoff for this "test app".

In terms of UX, perhaps drag and drop with a modifier key (like keyboard SHIFT) could inform Thorium that the user wishes not to permanently store the imported publication(s), and that they should not appear in the bookshelf (only load the reader window). This "transient" mode may be confusing to some users, but this would be an opt-in feature for advance usage anyway. Beside drag and drop, there is also "open with" or double-click from the file finder / explorer. And of course the built-in file chooser from the Thorium UI. Thoughts?

SimonPRH commented 4 years ago

Hi guys, I'm the one who sent the 'private email' on this, just thought I'd weigh in here! I am the Ebook Tech Coordinator for Penguin Random House in the UK, and one of the aspects that I look after is tools for our teams. I very much want to recommend Thorium be rolled out to everyone here, and it's just a few small features away from that (sent these to Laurent including this one).

To Daniel's question, the most straightforward model for us would be a dedicated option (with persistence) in the Thorium UI to switch on Transient Mode. The shift-modifier is, I feel, a bit too confusing for some of our users, who are not really 'advanced', they are just people working in Production etc who usually deal with print books and are not always very technical.

I note Philippe's comment regarding database size growth. Whilst an element of that was clearly just a bug, in general this is definitely a factor for us, as we have the largest catalogue in the publishing industry and thus have a huge throughput of files every day. You can imagine just how quickly a persistent library and associated directory grows. This is something we have to deal with in Apple Books which is a small but persistent irritation.

I note some of the discussion of how this would work relates to bookmarks and notes, but it would be perfectly reasonable for none of that to be available in this mode, as we really don't require it for proofing.

Sorry for the length of this post, but I am honoured to contribute here and am a huge fan of what you guys are doing!

Many thanks.

PhilippeBruno commented 4 years ago

Hi Simon, if your arguments don't convince the Thorium team on the necessity of that feature, nothing will.

@danielweck Instead of the "shift" acrobatics, how about an EPUB gets included in Thorium's database ONLY if the user annotates or adds bookmarks to it? Because quite frankly, there is no need to even waste ressources by copying an EPUB if there are no annotation or bookmark.

danielweck commented 4 years ago

Thank you for your feedback and suggestions, Simon. Much appreciated. A persistent switch in Thorium's settings may indeed work for advanced users who only use Thorium for proof reading. I think that the "shift modifier" (or other method) can be useful to negate/toggle the default user-configured setting as well. For example, a end-user can be mainly a proof-reader, whilst occasionally wanting to preserve their own personal bookshelf.

SimonPRH commented 4 years ago

That's even better! I didn't want to suggest anything that would be more UI work for you guys, haha. If it had both this would be a really universal tool.

PhilippeBruno commented 3 years ago

Hi Daniel, any progress on this much needed feature of NOT automatically storing EPUBs in Thorium's own repository? Lately, I have been searching through a lot of different EPUBs for my thesis and I keep deleting between 100 MB to 200 MB of EPUBs from Thorium every day. You cannot imagine how frustrating it is becoming for someone who simply opens an EPUB, searches for some keywords, reads a few paragraphs or chapters, closes the EPUB and deletes it from the Downloads folder on the computer only to remember it is stored in Thorium and that at the end of the day, all those "transient" EPUBs will need to be deleted to prevent infobesity!

If you calculate a CRC for each EPUB we open, why don't you simply store the annotations in the DB along with the CRC as a key? When the user opens the EPUB again, regardless of where it is now saved on the computer, lookup for that key in the DB and load the annotations if a match is found. KISS (keep it stupid simple).

I think I will start a protest movement running down all the major cities of the world with banners that read: "Save the planet! Don't copy EPUBs in Thorium DB wasting valuable space on SSDs! Just save the annotations!" ;)

That model of copying EPUBs to a local repository is an archaic concept; think outside the box and come up with something lighter, faster and more efficient like my idea of only saving the annotations and using the CRC as a key.

PhilippeBruno commented 3 years ago

Another idea could be to simply store the annotations as an attachment to the EPUB (perhaps a new amended EPUB format could be proposed with the addition of an XML file containing annotations, a bit like some programs add a layer of annotations to PDFs). I have no statistics on hand, but I presume that although annotating EPUBs is a nice to have feature, it probably is only a small percentage of users using this feature on a small percentage of their EPUBs. And yet, those annotations are not even portable nor sharable to other people. Anyway, my point is storing those EPUBs in Thorium's repository is a lot of effort for something that is probably not used that much...

Saving the annotations as an attachment to the EPUB would probably prove to be more useful as people would be able to share annotations. Take me as an example, I would be able to send an EPUB with the circled paragraphs and notes to my thesis director for her review... This is something we have been doing with PDFs for ever.... why not with EPUBs?

danielweck commented 2 years ago

This is a really worthy "discussion" but not yet ready for an actionable issue (development task). Moving...