dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.
https://pypi.org/project/lightnovel-crawler/
GNU General Public License v3.0
1.42k stars 279 forks source link

Fix this source: wuxiaworld.com #1708

Closed dastrdly6585 closed 1 year ago

dastrdly6585 commented 1 year ago

Let us know

Novel URL: https://www.wuxiaworld.com/novel/rankers-return App Location: EXE App Version: 3.0.1

Describe this issue

There are two different issues encountered when crawling wuxiaworld with the new webdriver:

? Enter novel page url or query novel: https://www.wuxiaworld.com/novel/rankers-return
? Do you want to log in? No
Retrieving novel info...
https://www.wuxiaworld.com/novel/rankers-return
Volumes:  58%|█████████████████████████████████████████▋                              | 11/19 [00:19<00:12,  1.60s/vol]Exception in thread Thread-1 (read_novel_info):
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.

During handling of the above exception, another exception occurred:

selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <div class="MuiPaper-root MuiPaper-elevation MuiPaper-rounded MuiPaper-elevation1 MuiAccordion-root MuiAccordion-rounded my-0 overflow-hidden overflow-clip rounded-none border-b-0 shadow-none first:rounded-none last:rounded-none dark:bg-[#202020] md:mb-[12px] md:rounded-[12px] md:bg-white md:shadow-ww-text-container md:first:rounded-[12px] md:last:rounded-[12px] md:dark:bg-gray-850 ww-idoywx">...</div> is not clickable at point (497, 52). Other element would receive the click: <div class="mr-8 flex flex-1 items-center pt-[8px] sm:pt-0 sm:pl-[24px]">...</div>
  (Session info: chrome=106.0.5249.103)
Stacktrace:
Backtrace:
        Ordinal0 [0x00451ED3+2236115]
        Ordinal0 [0x003E92F1+1807089]
        Ordinal0 [0x002F66FD+812797]
        Ordinal0 [0x0032BEC7+1031879]
        Ordinal0 [0x00329E6C+1023596]
        Ordinal0 [0x00327A5B+1014363]
        Ordinal0 [0x003266E7+1009383]
        Ordinal0 [0x0031C416+967702]
        Ordinal0 [0x00341A8C+1120908]
        Ordinal0 [0x0031BD84+966020]
        Ordinal0 [0x00341CA4+1121444]
        Ordinal0 [0x003559E2+1202658]
        Ordinal0 [0x003418A6+1120422]
        Ordinal0 [0x0031A73D+960317]
        Ordinal0 [0x0031B71F+964383]
        GetHandleVerifier [0x006FE7E2+2743074]
        GetHandleVerifier [0x006F08D4+2685972]
        GetHandleVerifier [0x004E2BAA+532202]
        GetHandleVerifier [0x004E1990+527568]
        Ordinal0 [0x003F080C+1837068]
        Ordinal0 [0x003F4CD8+1854680]
        Ordinal0 [0x003F4DC5+1854917]
        Ordinal0 [0x003FED64+1895780]
        BaseThreadInitThunk [0x776FFA29+25]
        RtlGetAppContainerNamedObjectPath [0x77B37B5E+286]
        RtlGetAppContainerNamedObjectPath [0x77B37B2E+238]

NOVEL: Ranker'S Return
11 volumes and 550 chapters found
budikesuma commented 1 year ago

@dipu-bd

Can't install minify-html on termux

No matching distribution found for minify-html . . Screenshot_20221010-191031_Termux

ShakeBake commented 1 year ago

For me, termux just doesn't upgrade to 3.0. It goes through the motions but it's still 2.34. No errors shown.

Requirement already satisfied: lightnovel-crawler in /data/data/com.termux/files/usr/lib/python3.10/site-packages (2.34.0) Collecting lightnovel-crawler Using cached lightnovel_crawler-3.0.1-py3-none-any.whl (500 kB) Requirement already satisfied: ascii in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (3.6) Requirement already satisfied: requests<3.0.0,>=2.20.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (2.28.1) Using cached lightnovel_crawler-3.0.0-py3-none-any.whl (499 kB) Requirement already satisfied: js2py==0.71 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (0.71) Requirement already satisfied: ebooklib<1.0.0,>=0.17.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (0.17.1) Requirement already satisfied: python-slugify<7.0.0,>=4.0.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (6.1.2) Requirement already satisfied: base58~=2.1.1 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (2.1.1) Requirement already satisfied: cloudscraper>=1.2.60 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (1.2.64) Requirement already satisfied: regex in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (2022.9.13) Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (0.4.5) Requirement already satisfied: packaging in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (21.3) Requirement already satisfied: prompt-toolkit~=3.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (3.0.31) Requirement already satisfied: beautifulsoup4<5.0.0,>=4.8.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (4.11.1) Requirement already satisfied: questionary>=1.6.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (1.10.0) Requirement already satisfied: pillow>=6.0.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (9.2.0) Requirement already satisfied: html5lib~=1.1 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (1.1) Requirement already satisfied: tqdm<5.0,>=4.60 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (4.64.1) Requirement already satisfied: python-dotenv<1.0.0,>=0.15.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (0.21.0) Requirement already satisfied: lxml<5.0.0,>=4.0.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (4.9.1) Requirement already satisfied: pyease-grpc>=1.3.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from lightnovel-crawler) (1.3.0) Requirement already satisfied: tzlocal>=1.2 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from js2py==0.71->lightnovel-crawler) (4.2) Requirement already satisfied: pyjsparser>=2.5.1 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from js2py==0.71->lightnovel-crawler) (2.7.1) Requirement already satisfied: six>=1.10 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from js2py==0.71->lightnovel-crawler) (1.16.0) Requirement already satisfied: soupsieve>1.2 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from beautifulsoup4<5.0.0,>=4.8.0->lightnovel-crawler) (2.3.2.post1) Requirement already satisfied: pyparsing>=2.4.7 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from cloudscraper>=1.2.60->lightnovel-crawler) (3.0.9) Requirement already satisfied: requests-toolbelt>=0.9.1 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from cloudscraper>=1.2.60->lightnovel-crawler) (0.10.0) Requirement already satisfied: webencodings in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from html5lib~=1.1->lightnovel-crawler) (0.5.1) Requirement already satisfied: wcwidth in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from prompt-toolkit~=3.0->lightnovel-crawler) (0.2.5) Requirement already satisfied: protobuf>=3.19.0 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from pyease-grpc>=1.3.0->lightnovel-crawler) (4.21.7) Requirement already satisfied: text-unidecode>=1.3 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from python-slugify<7.0.0,>=4.0.0->lightnovel-crawler) (1.3) Requirement already satisfied: charset-normalizer<3,>=2 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from requests<3.0.0,>=2.20.0->lightnovel-crawler) (2.1.1) Requirement already satisfied: certifi>=2017.4.17 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from requests<3.0.0,>=2.20.0->lightnovel-crawler) (2022.9.24) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from requests<3.0.0,>=2.20.0->lightnovel-crawler) (1.26.12) Requirement already satisfied: idna<4,>=2.5 in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from requests<3.0.0,>=2.20.0->lightnovel-crawler) (3.4) Requirement already satisfied: pytz-deprecation-shim in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from tzlocal>=1.2->js2py==0.71->lightnovel-crawler) (0.1.0.post0) Requirement already satisfied: tzdata in /data/data/com.termux/files/usr/lib/python3.10/site-packages (from pytz-deprecation-shim->tzlocal>=1.2->js2py==0.71->lightnovel-crawler) (2022.4)

dipu-bd commented 1 year ago

Use: pip uninstall lightnovel-crawler && pip install -U lightnovel-crawler

Please report one problem per issue. If you want to discuss casually, there is a discussion page: https://github.com/dipu-bd/lightnovel-crawler/discussions

dipu-bd commented 1 year ago
  • Logging in doesn't appear to do anything, as after getting 10 chapters you get a notice saying you have viewed the guest chapter limit. The headless chrome instances opened by the crawler don't indicate you are logged in.

The browser is relatively new feature, I have not added support for login yet. I will figure out how to do it sometime in future.

dipu-bd commented 1 year ago
  • For novels where there are multiple volumes, you have to make sure to continue scrolling up in the browser while the chapters are being populated, otherwise it fails to get all the chapters. If you don't do anything, the crawler just stalls until you get an error and it stops crawling the remaining chapters. If it stalls, you can get it to continue collecting chapters by scrolling up and manually clicking the next volume's drop-down, but if you don't continue scrolling up for the remaining volumes it immediately errors out instead of stalling. The error seems to suggest that because the volume drop-down is outside of the visible window bounds it can't be clicked, which might explain why continuously scrolling up avoids the problem:

This is strange issue indeed. When I checked with my PC with 12+ volumes, it worked without any issue. I will check everything again.

avggeek commented 1 year ago

Hopefully this is the right issue to include my findings as previous issues like #1579 and #1701 have been closed.

The error I get is slightly different after upgrading to 3.0.1.

I tried to download (this novel) using the following command:

lncrawl --login Bearer ey.. -s https://www.wuxiaworld.com/novel/overgeared --format epub --filename "Overgeared - rainbowturtle" --filename-only --output . --single

The download errors out with the following message:

Failed to get chapter: Message: no such element: Unable to locate element: {"method":"css selector","selector":".chapter-content"}
  (Session info: headless chrome=106.0.5249.103)
Stacktrace:
#0 0x56271decf2c3 <unknown>
#1 0x56271dcd883a <unknown>
#2 0x56271dd11985 <unknown>
#3 0x56271dd11b61 <unknown>
#4 0x56271dd49d14 <unknown>
#5 0x56271dd2ff6d <unknown>
#6 0x56271dd47a50 <unknown>
#7 0x56271dd2fd63 <unknown>
#8 0x56271dd047e3 <unknown>
#9 0x56271dd05a21 <unknown>
#10 0x56271df1d18e <unknown>
#11 0x56271df20622 <unknown>
#12 0x56271df03aae <unknown>
#13 0x56271df212a3 <unknown>
#14 0x56271def7ecf <unknown>
#15 0x56271df41588 <unknown>
#16 0x56271df41706 <unknown>
#17 0x56271df5b8b2 <unknown>
#18 0x7f9da4bc5ea7 <unknown>

Chapters:   0%|                           | 2/1705 [01:38<19:20:13, 40.88s/item]Failed to get chapter: Message: no such element: Unable to locate element: {"method":"css selector","selector":".chapter-content"}
  (Session info: headless chrome=106.0.5249.103)
Stacktrace:
#0 0x56271decf2c3 <unknown>
#1 0x56271dcd883a <unknown>
#2 0x56271dd11985 <unknown>
#3 0x56271dd11b61 <unknown>
#4 0x56271dd49d14 <unknown>
#5 0x56271dd2ff6d <unknown>
#6 0x56271dd47a50 <unknown>
#7 0x56271dd2fd63 <unknown>
#8 0x56271dd047e3 <unknown>
#9 0x56271dd05a21 <unknown>
#10 0x56271df1d18e <unknown>
#11 0x56271df20622 <unknown>
#12 0x56271df03aae <unknown>
#13 0x56271df212a3 <unknown>
#14 0x56271def7ecf <unknown>
#15 0x56271df41588 <unknown>
#16 0x56271df41706 <unknown>
#17 0x56271df5b8b2 <unknown>
#18 0x7f9da4bc5ea7 <unknown>

Chapters:   0%|                           | 3/1705 [03:09<30:13:48, 63.94s/item]Chapters:   0%|                           | 3/1705 [04:21<41:10:25, 87.09s/item]
Traceback (most recent call last):
  File "/home/avggeek/.local/bin/lncrawl", line 8, in <module>
    sys.exit(main())
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/__init__.py", line 14, in main
    start_app()
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/core/__init__.py", line 68, in start_app
    run_bot(bot)
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/bots/__init__.py", line 16, in run_bot
    ConsoleBot().start()
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/bots/console/integration.py", line 92, in start
    self.app.start_download()
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/core/app.py", line 155, in start_download
    fetch_chapter_body(self)
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/core/downloader.py", line 88, in fetch_chapter_body
    for progress in app.crawler.download_chapters(app.chapters):
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/templates/browser/basic.py", line 111, in download_chapters
    chapter.body = self.download_chapter_body_in_browser(chapter)
  File "/home/avggeek/.lncrawl/sources/en/w/wuxiacom.py", line 221, in download_chapter_body_in_browser
    content = self.browser.find("chapter-content", By.CLASS_NAME).as_tag()
  File "/home/avggeek/.local/lib/python3.9/site-packages/lncrawl/core/browser.py", line 162, in find
    return self._driver.find_element(by, selector)
  File "/home/avggeek/.local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 856, in find_element
    return self.execute(Command.FIND_ELEMENT, {
  File "/home/avggeek/.local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 427, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/avggeek/.local/lib/python3.9/site-packages/selenium/webdriver/remote/remote_connection.py", line 344, in execute
    return self._request(command_info[0], url, body=data)
  File "/home/avggeek/.local/lib/python3.9/site-packages/selenium/webdriver/remote/remote_connection.py", line 366, in _request
    response = self._conn.request(method, url, body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/lib/python3.9/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
dipu-bd commented 1 year ago

fixed