Several improvements around handling CSS and others.

miguelitoelgrande commented 5 years ago

This is my fork providing the following improvements (mainly improvements in extractHTML.js logic):

support for additional CSS attributes (display, colspan, border-collapse,...)
support for fancy article headers {background-image, position, z-index, background-*}
retain original image filenames where applicable (make editing of epub easier)
better naming for resulting CSS rules (make editing of epub easier)
"Dedup" of CSS rules - only store relevant changes compared to parent elements (smaller CSS files)
additional tags and attributes
some bug fixes (e.g. syntax highlighting in pre and code environments...) Related:
leverage ModHeader extension: ensure, images in the ePub are in accepted formats, not WebP images.
Pushed "page cleanups" via CSS and JavaScript to userscripts for TamperMonkey

Thank you for your great work. save-as-ebook is really a great extension.

alexadam commented 5 years ago

Wow, you did a lot of changes. I need a few days to review it... I think I'll allocate my next week to work on this project because there are a lot of issues waiting to be solved. Thanks for your help, I might come back with questions about your changes.

miguelitoelgrande commented 5 years ago

Hi Adam, anytime. Your extension is a great help in reading tech stuff on the Kindle instead of printing etc. My changes focus on producing more accurate output and helping to edit the resulting epub afterwards. The most relevant changes are in the extractHtml.js. The userscripts do a great job in preprocessing.

Looking forward to hear from you, Michael

alexadam commented 5 years ago

Sorry for the late reply. I’ll address your changes in 2 parts: first, about the additional css/js files and, then about the extractHtml.js changes.

There is a problem with cleaning the page before generating the ebook. I don’t like the current feature of inserting custom CSS to remove unwanted elements… It’s not user friendly because of the UI/UX and because a lot of people don’t know CSS. I’m thinking about removing it and find a better solution.

At the time I was working on it there was a bug on FF - you couldn’t access the reader mode from a web extension https://bugzilla.mozilla.org/show_bug.cgi?id=1286387 It doesn’t look like they fixed it but maybe there is a workaround, I’ll investigate more

I’ve never used Tampermonkey but it seems you have to add a script for each page you want to clean, and… I don’t want to add thousands of scripts, for every possible site that can be saved as ebook, to the main repo. Those scripts should be stored locally, if you need them.

I think that a ‘save as ebook’ app should do (only) what it says: save as ebook. I would remove everything related to ‘cleaning’ a web page, because cleaning is a non mandatory, unrelated preliminary step. And mixing them causes a lot of problems because you cannot make everybody happy.

There should be something like what ublock is for ads - an universal ‘reader mode’ extension, with a database of scripts and styles for as many sites as possible. So you don’t have to maintain anything or write code or waste time trying to identify which elements should be removed. I’ll take a look to see what’s available…

I didn’t have time to look on extractHtml.js changes, I’ll do it later.

miguelitoelgrande commented 5 years ago

Hi Adam,

absolutely on the same page.

I would also focus save-as-ebook on the part of extracting given pages to epub and leave the preprocessing to different tooling.

My user scripts only serve as an example for this. As I am using several PCs throughout the week, I was looking for an easy way to keep my CSS definitions in sync. Tampermonkey can do this (even though its main focus is JS, not CSS). But this way I can also simplify the document structure where needed.

I did not think about the "reader mode" so far. I thought about switching to the print layout where applicable, but anyways, this would be preprocessing. And some page owners probably do not really care for use cases like printing or conversion to ebooks.

Another reason for the user scripts: I started using an extra script for hypenation in Chrome (I think FF has this built in?). Very handy as I do not need an extra iteration through Calibre for this (Kindle KF8 does not perform the hypenation unless there are soft-hypens in the document).

So, let's focus on " extractHtml.js". I am pretty sure, you will like some of the fixes. I can provide some URLs of sample pages if some improvements are not clear. As you might imagine, I am primary converting technical pages including tables and source code highlighting, so a lot more markup in the text than on novels and such.

Kind regards,

Michael

alexadam commented 5 years ago

sorry for taking so long... you did a lot of changes and I don't have enough time :) I'm trying to think of a way for automatically testing the web ext. before a release, maybe with Puppeteer...

alexadam commented 5 years ago

ok, so I created a 'tests' folder with a small puppeteer app that starts a chrome instance + the extension. I want to add some test pages & epub references and find a way to compare them with what is being generated. I don't know if this is the best way to do it, but is the quickest for now... In the next days I'll add as many references as possible and pages that didn't work or have issues.

poire-z commented 4 years ago

Hello, and thanks for that extension!

Just letting you know of a few fixes and improvements I've made for my own reading of the generated EPUBs with KOReader on eInk devices (some context here). I've taken many bits from @miguelitoelgrande work (and from #19), so it feels a bit awkward opening a PR with my changes :) Also, I'm using it with some older version of Firefox, and can't (don't really have time) to check how it would work with newer Firefox or Chrome. But feel free to pick any of my fixes that make sense.

@miguelitoelgrande : regarding your commit:

handle styling of links and other exceptions better (a, strong, em - tags)

This is a bit wrong. These are not exceptions, and what you did to these should be done to all tags, for all CSS properties that are inherited (per specs) and only them. I followed up on your huge improvements commit in https://github.com/poire-z/save-as-ebook/commit/8daadb52c00f4ee14fe0f755b72e80617b75f5d6 if you're interested.

Note: the choice of which styles to include or not is quite use-case dependant :| @miguelitoelgrande added background-image, but I don't want them (as well as letter-spacing and others). But I want "float", which I understand many other EPUB reading softwares will not want. So, the choice of what styles to save or not may require manual tweaking to the code (until there is some UI configuration for that :)

alexadam / save-as-ebook

Several improvements around handling CSS and others. #33