Open chazzam opened 7 years ago
yeah, probably rewrite the downloader section to just download pages through the scraper and save them as-is, with no modifications.
Consider having the epub generator do xpath filtering on a per novel/epub basis when adding the chapter to the epub instead of those edits being saved on disk. Optionally, a variant where it creates a new folder of those files instead of adding them to the epub, for troubleshooting type stuff. This would be a three step process, instead of the current two. Download, filter, epub. Currently its download+filter, epub. I would at least make it download, filter+epub, with potentially a flag to split filter and epub. or at least make the operation more debuggable.
this method would also allow just straight up like wget -m
to be used to download the pages and then operate on those and handle updates more intelligently.
Also, probably add support for connecting with a login to the site for downloading.
Update so that a conf.d/ or a single *.conf can be used for configuring, and the xpath rules and login info can be shared within a config file. One config file can still define multiple epubs, to allow multiple serializations hosted on a single site to be downloaded and converted each into their own epub.
Being able to remove the editor comments in chapters like this one http://www.translationnations.com/translations/stellar-transformations/st-book-16-chapter-44/ would be nice. They're just out of place in the middle of the chapters and if it's just an opinion out of place in the book period.
The translator comments could probably be removed as well just for consistency since they also appear in the middle of chapters, it might've been better to do them on a case by case basis but that'd be too much of a hassle so just blast them all.
For reference, in the example, it's stuff like s/[TL:.]//g; s/[Robin:.]//g; but would need a way to configure those as part of chapter cleanup. Allow for other common spelling/name changes as well. s/Lou Fang/Lou Feng/; for example
instead of trying readability, mayhap follow feediron's example and have 'cleanup' xpath selections of content to be removed from the web scraped content.
Can specify a list of tags in the config to remove this way