dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.
https://pypi.org/project/lightnovel-crawler/
GNU General Public License v3.0
1.5k stars 293 forks source link

Does anybody else have this problem when using lncrawl in 69xinshu #2269

Closed itsyahma closed 8 months ago

itsyahma commented 9 months ago

Let us know

Novel URL: https://www.69xinshu.com/book/9969673.htm App Location: PIP | EXE | Discord | Telegram App Version: x.y.z

Describe this issue

itsyahma commented 9 months ago

image_2024-02-13_201534450

itsyahma commented 9 months ago

this is what appears

camp00000 commented 9 months ago

I fixed 69xinshu in #2256 which is currently not released yet, this branch fixes the first error, the 403 you got -> not sure about that. You can try running lncrawl with the --auto-proxy arg to see if that helps to 403, but donwloading won't work unless you use the dev branch or wait for the next release.

ncuxie commented 9 months ago

I fixed 69xinshu in #2256 which is currently not released yet, this branch fixes the first error, the 403 you got -> not sure about that. You can try running lncrawl with the --auto-proxy arg to see if that helps to 403, but donwloading won't work unless you use the dev branch or wait for the next release.

I used the dev branch to download from this website and found that there was a problem with downloading more than 650 chapters at once. image image The website will deny access. It seems that the anti-crawler has been upgraded. (:з」∠)

camp00000 commented 9 months ago

Darn, that's not too good. I'll see if there's anything that can be done there.. I suppose for now you can try downloading in batches, you can select by chapter range so you can enter 1-500 and then 501-1000 and so on to probably bypass this if it's just a simple check.

and maybe combine that with --auto-proxy to get new source IPs each download batch.

Let me know if that works. @ncuxie

ncuxie commented 9 months ago

image I found that it started getting errors from chapter 251, so I tried downloading only chapters 1-250 and didn't encounter any problems. However, when I try to download the second time

image It couldn't even get the directory, so I used a browser to access the website

image This looks a bit troublesome (:з」∠)

After a period of time without completing the verification, the website becomes inaccessible.

But if you are reading novels normally, you will not encounter verification if you read more than 250 chapters a day, so I guess it may be that you download too frequently. 🤔

I will try --auto-proxy later.

@camp00000

camp00000 commented 8 months ago

image This looks a bit troublesome (:з」∠)

After a period of time without completing the verification, the website becomes inaccessible.

But if you are reading novels normally, you will not encounter verification if you read more than 250 chapters a day, so I guess it may be that you download too frequently. 🤔

I will try --auto-proxy later.

@camp00000

There's rate-limiting that can be done on the downloader-side but no way to enforce downloading only X amount of chapters.

My hopes are currently on the --auto-proxy approach, IP-Reputation may or may not break that but we'll see I guess.

To note: if I understood correctly, auto-proxy makes the crawler cicle through proxies when downloading, so it may be possible to download an entire novel with lots of chapters at once with the auto-proxy option, given that this is actually what it does and the IPs aren't all/mostly banned already.

Let me know how it goes.

ncuxie commented 8 months ago

I can get chapters without --auto-proxy but not with --auto-proxy.

$ lncrawl -s https://www.69xinshu.com/book/40107.htm

=================================================== [#] Lightnovel Crawler v3.4.2 https://github.com/dipu-bd/lightnovel-crawler \--------------------------------------------------------------------------------------- -> Press Ctrl + C to exit Retrieving novel info... [#] 从时间停止开始纵横诸天 14 volumes and 1357 chapters found. \- https://www.69xinshu.com/book/40107.htm ? Enter output directory: C:\Users\XIE\Lightnovels\www-69xinshu-com\C ong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian
**_$ lncrawl -s https://www.69xinshu.com/book/40107.htm --auto-proxy_**

=================================================== [#] Lightnovel Crawler v3.4.2 https://github.com/dipu-bd/lightnovel-crawler \--------------------------------------------------------------------------------------- Sources: 100%|█████████████████████| 24/24 [00:03<00:00, 6.20file/s] -> Press Ctrl + C to exit Retrieving novel info... Exception in thread Thread-4: Traceback (most recent call last): File "D:\anaconda3\lib\threading.py", line 980, in _bootstrap_inner self.run() File "D:\anaconda3\lib\threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "C:\Users\XIE\.lncrawl\sources\zh\69shuba.py", line 70, in read_novel_info soup = self.get_soup(self.novel_url, encoding="gbk") File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 304, in get_soup response = self.get_response(url, **kwargs) File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 201, in get_response return self.__process_request( File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 130, in __process_request raise e File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 123, in __process_request response.raise_for_status() File "D:\anaconda3\lib\site-packages\requests\models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) **requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://www.69xinshu.com/book/40107.htm** ! Error: No chapters found File "D:\anaconda3\lib\site-packages\lncrawl\bots\console\integration.py", line 107, in start raise e File "D:\anaconda3\lib\site-packages\lncrawl\bots\console\integration.py", line 101, in start _download_novel() File "D:\anaconda3\lib\site-packages\lncrawl\bots\console\integration.py", line 85, in _download_novel self.app.get_novel_info() File "D:\anaconda3\lib\site-packages\lncrawl\core\app.py", line 137, in get_novel_info raise Exception("No chapters found") \---------------------------------------------------------------------- \- https://github.com/dipu-bd/lightnovel-crawler/issues \======================================================================

camp00000 commented 8 months ago

It looks like some of the the proxies are likely already on a blacklist or have a very bad IP reputation.

So the other somewhat simple way forward would be to find working proxies for 69xinshu and test them - once you have a few suitable ones you could make a custom proxies file and use as described in the lncrawl help section --proxy-file FILE Proxies as SCHEME://HOST:PORT@USER:PASSWORD format in each line. All except HOST are optional to download everything at once hopefully.

Otherwise you can slowly download part-by-part with your own IP and that might work given enough time and only selecting a few hundred chaps per day max. I suggest this way if you're fine waiting a bit and downloading in parts. The EPUB can always be concatenated into one big thing with some tool at a later time if you prefer it that way.

To make --auto-proxy viable as is for this source, I think the whole proxy handling would need to be reworked to treat certain status codes (like 401 access denied) as potential proxy issues instead of server/request issues. So that's not very feasible.

wizerdo37 commented 8 months ago

this link of raws does not have limit rates for downloads: https://www.ddxsss.com/

camp00000 commented 8 months ago

I checked and lncrawl doesn't currently support this source yet but if it does indeed not have any rate-limiting like 69xinshu then it would be a viable alternative, the site structure looks relatively similar as well so adding it shouldn't be too big of an issue.

I even found a novel with the same title as mentioned in the above logs https://www.ddxsss.com/book/46000/ so they seem to overlap in that part as well.

If someone wants to create an issue to add this source I'll look into doing that later this week.

camp00000 commented 8 months ago

I actually went ahead and added the crawler already, it's currently a pull request so once it's merged into dev you can test it out by installing the newest dev version locally. https://github.com/dipu-bd/lightnovel-crawler/pull/2287

I was able to download 1.3k chaps at once without any significant issues. The chapters with HTTP 503 reported did have their content available so it seemed to have failed once out of the few retries it has per chapter in those instances but no blocking from cloudflare / captchas or the like.

Retrieving novel info...

📒 从时间停止开始纵横诸天
14 volumes and 1357 chapters found.
🔗 https://www.ddxsss.com/book/46000

? Enter output directory: /home/.../lightnovel-crawler/Lightnovels/www-ddxsss-com/Cong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian
? Which chapters to download? Everything! (1357 chapters)
? 1357 chapters selected Continue
? Which output formats to create? [epub]
? How many files to generate? Pack everything into a single file
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/148.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/433.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/451.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/457.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/927.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1135.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1150.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1225.html
Chapters: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1357/1357 [00:42<00:00, 32.19item/s]
  Images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.48item/s]
Created: Cong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian c1-1357.epub
✨ Task completed