Open myndzi opened 1 year ago
(p.s. - I don't quite follow the whole epub construction, but later chapters of this web novel include a lot of fan art at the bottom. It's not clean to distinguish from the text content, so I left it in. But I noticed that even the early volumes that don't do this are ~80 mb files. Perchance is the epub constructor including all downloaded images, even those not referenced by the particular volume? I'm not sure how to avoid that, but it seems worthwhile)
Edit: yep, that's exactly what was happening. Updated my branch with a fix for that, too: https://github.com/myndzi/lightnovel-crawler/commit/29f3239e5e79297bb717d5fef7d69c90b4712b5a
Describe the bug
HTML sources with non-BMP contents (such as emojis) cause a failure in
etree.fromstring
. It returnsNone
instead of a parsed HTML tree.I was unable to find if there is an alternate invocation or parser that would make this work correctly, however, converting non-BMP characters to their HTML entity equivalents works. It is, however, thwarted by the fact that for some reason the default implementation of
extract_chapter_images
modifieschapter.body
by running it through Beautiful Soup'sdecode_contents
.You can see the issue with the 561st chapter of The Wandering Inn, which contains "🐀" and one more mouse (different color) that is no longer easy for me to find ;)
command line similar to this should reproduce:
python lncrawl --format epub -s 'https://wanderinginn.com/' --multi --range 561 561
- except TWI source is also broken. You can use my branch here: https://github.com/dipu-bd/lightnovel-crawler/compare/master...myndzi:lightnovel-crawler:myndzi/fix-wanderinginn and comment out this method to get a full reproduction.Stack trace:
Notes
My branch contains some changes to how lncrawl names the volumes, for my own preference - and also post-process the individual chapters to make certain quirks of the author's style more recognizable in an e-reader format (they use different colored text for some individuals speaking). I am happy to contribute the changes if you'd like them, but I'm not expecting them to be mergeable as-is. Just thought I'd report the underlying bug since it was a huge hassle to track down.
There's also a minor bug with this line: https://github.com/dipu-bd/lightnovel-crawler/compare/master...myndzi:lightnovel-crawler:myndzi/fix-wanderinginn#diff-b8196171235450b44c650f8bddb0c17c94f3e848bdea34f4b1b293a75efc5269L243
where the logger complains about not all arguments being used (added
%s
)Let us know
App source: git source App version: current master (https://github.com/dipu-bd/lightnovel-crawler/commit/49f2007b26ad7327302c5cda1a04011603afbd8d) Your OS: Mac