chazzam / wordpress-epub

Download articles from wordpress, convert to epub
GNU Lesser General Public License v3.0
23 stars 10 forks source link

better cleanup functionality #11

Open chazzam opened 7 years ago

chazzam commented 7 years ago

instead of trying readability, mayhap follow feediron's example and have 'cleanup' xpath selections of content to be removed from the web scraped content.

Can specify a list of tags in the config to remove this way

chazzam commented 6 years ago

yeah, probably rewrite the downloader section to just download pages through the scraper and save them as-is, with no modifications.

Consider having the epub generator do xpath filtering on a per novel/epub basis when adding the chapter to the epub instead of those edits being saved on disk. Optionally, a variant where it creates a new folder of those files instead of adding them to the epub, for troubleshooting type stuff. This would be a three step process, instead of the current two. Download, filter, epub. Currently its download+filter, epub. I would at least make it download, filter+epub, with potentially a flag to split filter and epub. or at least make the operation more debuggable.

this method would also allow just straight up like wget -m to be used to download the pages and then operate on those and handle updates more intelligently.

Also, probably add support for connecting with a login to the site for downloading.

Update so that a conf.d/ or a single *.conf can be used for configuring, and the xpath rules and login info can be shared within a config file. One config file can still define multiple epubs, to allow multiple serializations hosted on a single site to be downloaded and converted each into their own epub.

Zurandis commented 5 years ago

Being able to remove the editor comments in chapters like this one http://www.translationnations.com/translations/stellar-transformations/st-book-16-chapter-44/ would be nice. They're just out of place in the middle of the chapters and if it's just an opinion out of place in the book period.

Zurandis commented 5 years ago

The translator comments could probably be removed as well just for consistency since they also appear in the middle of chapters, it might've been better to do them on a case by case basis but that'd be too much of a hassle so just blast them all.

chazzam commented 5 years ago

For reference, in the example, it's stuff like s/[TL:.]//g; s/[Robin:.]//g; but would need a way to configure those as part of chapter cleanup. Allow for other common spelling/name changes as well. s/Lou Fang/Lou Feng/; for example