dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.
https://pypi.org/project/lightnovel-crawler/
GNU General Public License v3.0
1.45k stars 288 forks source link

Error processing HTML sources with non-BMP characters #1964

Open myndzi opened 1 year ago

myndzi commented 1 year ago

Describe the bug

HTML sources with non-BMP contents (such as emojis) cause a failure in etree.fromstring. It returns None instead of a parsed HTML tree.

I was unable to find if there is an alternate invocation or parser that would make this work correctly, however, converting non-BMP characters to their HTML entity equivalents works. It is, however, thwarted by the fact that for some reason the default implementation of extract_chapter_images modifies chapter.body by running it through Beautiful Soup's decode_contents.

You can see the issue with the 561st chapter of The Wandering Inn, which contains "🐀" and one more mouse (different color) that is no longer easy for me to find ;)

command line similar to this should reproduce: python lncrawl --format epub -s 'https://wanderinginn.com/' --multi --range 561 561 - except TWI source is also broken. You can use my branch here: https://github.com/dipu-bd/lightnovel-crawler/compare/master...myndzi:lightnovel-crawler:myndzi/fix-wanderinginn and comment out this method to get a full reproduction.

Stack trace:

Failed to generate "epub": Document is empty
Traceback (most recent call last):
  File "/Users/myndzi/local/python/lightnovel-crawler/lncrawl/binders/__init__.py", line 64, in generate_books
    outputs[fmt] = make_epubs(app, data)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/lncrawl/binders/epub.py", line 229, in make_epubs
    output = bind_epub_book(
             ^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/lncrawl/binders/epub.py", line 198, in bind_epub_book
    epub.write_epub(file_path, book, {})
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/epub.py", line 1746, in write_epub
    epub.write()
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/epub.py", line 1369, in write
    self._write_items()
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/epub.py", line 1356, in _write_items
    self.out.writestr('%s/%s' % (self.book.FOLDER_NAME, item.file_name), self._get_nav(item))
                                                                         ^^^^^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/epub.py", line 1212, in _get_nav
    inserted_pages = get_pages_for_items([item for item in self.book.get_items_of_type(ebooklib.ITEM_DOCUMENT) \
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/utils.py", line 119, in get_pages_for_items
    pages_from_docs = [get_pages(item) for item in items]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/utils.py", line 119, in <listcomp>
    pages_from_docs = [get_pages(item) for item in items]
                       ^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/utils.py", line 96, in get_pages
    body = parse_html_string(item.get_body_content())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/ebooklib/utils.py", line 48, in parse_html_string
    html_tree = html.document_fromstring(s, parser=utf8_parser)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myndzi/local/python/lightnovel-crawler/venv/lib/python3.11/site-packages/lxml/html/__init__.py", line 761, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty

Notes

My branch contains some changes to how lncrawl names the volumes, for my own preference - and also post-process the individual chapters to make certain quirks of the author's style more recognizable in an e-reader format (they use different colored text for some individuals speaking). I am happy to contribute the changes if you'd like them, but I'm not expecting them to be mergeable as-is. Just thought I'd report the underlying bug since it was a huge hassle to track down.

There's also a minor bug with this line: https://github.com/dipu-bd/lightnovel-crawler/compare/master...myndzi:lightnovel-crawler:myndzi/fix-wanderinginn#diff-b8196171235450b44c650f8bddb0c17c94f3e848bdea34f4b1b293a75efc5269L243

where the logger complains about not all arguments being used (added %s)

Let us know

App source: git source App version: current master (https://github.com/dipu-bd/lightnovel-crawler/commit/49f2007b26ad7327302c5cda1a04011603afbd8d) Your OS: Mac

myndzi commented 1 year ago

(p.s. - I don't quite follow the whole epub construction, but later chapters of this web novel include a lot of fan art at the bottom. It's not clean to distinguish from the text content, so I left it in. But I noticed that even the early volumes that don't do this are ~80 mb files. Perchance is the epub constructor including all downloaded images, even those not referenced by the particular volume? I'm not sure how to avoid that, but it seems worthwhile)

Edit: yep, that's exactly what was happening. Updated my branch with a fix for that, too: https://github.com/myndzi/lightnovel-crawler/commit/29f3239e5e79297bb717d5fef7d69c90b4712b5a