Closed itsyahma closed 8 months ago
this is what appears
I fixed 69xinshu in #2256 which is currently not released yet, this branch fixes the first error, the 403 you got -> not sure about that.
You can try running lncrawl with the --auto-proxy
arg to see if that helps to 403, but donwloading won't work unless you use the dev branch or wait for the next release.
I fixed 69xinshu in #2256 which is currently not released yet, this branch fixes the first error, the 403 you got -> not sure about that. You can try running lncrawl with the
--auto-proxy
arg to see if that helps to 403, but donwloading won't work unless you use the dev branch or wait for the next release.
I used the dev branch to download from this website and found that there was a problem with downloading more than 650 chapters at once. The website will deny access. It seems that the anti-crawler has been upgraded. (:з」∠)
Darn, that's not too good. I'll see if there's anything that can be done there.. I suppose for now you can try downloading in batches, you can select by chapter range so you can enter 1-500 and then 501-1000 and so on to probably bypass this if it's just a simple check.
and maybe combine that with --auto-proxy to get new source IPs each download batch.
Let me know if that works. @ncuxie
I found that it started getting errors from chapter 251, so I tried downloading only chapters 1-250 and didn't encounter any problems. However, when I try to download the second time
It couldn't even get the directory, so I used a browser to access the website
This looks a bit troublesome (:з」∠)
After a period of time without completing the verification, the website becomes inaccessible.
But if you are reading novels normally, you will not encounter verification if you read more than 250 chapters a day, so I guess it may be that you download too frequently. 🤔
I will try --auto-proxy later.
@camp00000
This looks a bit troublesome (:з」∠)
After a period of time without completing the verification, the website becomes inaccessible.
But if you are reading novels normally, you will not encounter verification if you read more than 250 chapters a day, so I guess it may be that you download too frequently. 🤔
I will try --auto-proxy later.
@camp00000
There's rate-limiting that can be done on the downloader-side but no way to enforce downloading only X amount of chapters.
My hopes are currently on the --auto-proxy
approach, IP-Reputation may or may not break that but we'll see I guess.
To note: if I understood correctly, auto-proxy makes the crawler cicle through proxies when downloading, so it may be possible to download an entire novel with lots of chapters at once with the auto-proxy option, given that this is actually what it does and the IPs aren't all/mostly banned already.
Let me know how it goes.
I can get chapters without --auto-proxy
but not with --auto-proxy
.
$ lncrawl -s https://www.69xinshu.com/book/40107.htm
===================================================
[#] Lightnovel Crawler v3.4.2
https://github.com/dipu-bd/lightnovel-crawler
\---------------------------------------------------------------------------------------
-> Press Ctrl + C to exit
Retrieving novel info...
[#] 从时间停止开始纵横诸天
14 volumes and 1357 chapters found.
\- https://www.69xinshu.com/book/40107.htm
? Enter output directory: C:\Users\XIE\Lightnovels\www-69xinshu-com\C
ong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian
**_$ lncrawl -s https://www.69xinshu.com/book/40107.htm --auto-proxy_**
===================================================
[#] Lightnovel Crawler v3.4.2
https://github.com/dipu-bd/lightnovel-crawler
\---------------------------------------------------------------------------------------
Sources: 100%|█████████████████████| 24/24 [00:03<00:00, 6.20file/s]
-> Press Ctrl + C to exit
Retrieving novel info...
Exception in thread Thread-4:
Traceback (most recent call last):
File "D:\anaconda3\lib\threading.py", line 980, in _bootstrap_inner
self.run()
File "D:\anaconda3\lib\threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\XIE\.lncrawl\sources\zh\69shuba.py", line 70, in read_novel_info
soup = self.get_soup(self.novel_url, encoding="gbk")
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 304, in get_soup
response = self.get_response(url, **kwargs)
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 201, in get_response
return self.__process_request(
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 130, in __process_request
raise e
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 123, in __process_request
response.raise_for_status()
File "D:\anaconda3\lib\site-packages\requests\models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
**requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://www.69xinshu.com/book/40107.htm**
! Error: No chapters found
It looks like some of the the proxies are likely already on a blacklist or have a very bad IP reputation.
So the other somewhat simple way forward would be to find working proxies for 69xinshu and test them - once you have a few suitable ones you could make a custom proxies file and use as described in the lncrawl help section
--proxy-file FILE Proxies as SCHEME://HOST:PORT@USER:PASSWORD format in each line. All except HOST are optional
to download everything at once hopefully.
Otherwise you can slowly download part-by-part with your own IP and that might work given enough time and only selecting a few hundred chaps per day max. I suggest this way if you're fine waiting a bit and downloading in parts. The EPUB can always be concatenated into one big thing with some tool at a later time if you prefer it that way.
To make --auto-proxy
viable as is for this source, I think the whole proxy handling would need to be reworked to treat certain status codes (like 401 access denied) as potential proxy issues instead of server/request issues. So that's not very feasible.
this link of raws does not have limit rates for downloads: https://www.ddxsss.com/
I checked and lncrawl doesn't currently support this source yet but if it does indeed not have any rate-limiting like 69xinshu then it would be a viable alternative, the site structure looks relatively similar as well so adding it shouldn't be too big of an issue.
I even found a novel with the same title as mentioned in the above logs https://www.ddxsss.com/book/46000/ so they seem to overlap in that part as well.
If someone wants to create an issue to add this source I'll look into doing that later this week.
I actually went ahead and added the crawler already, it's currently a pull request so once it's merged into dev you can test it out by installing the newest dev version locally. https://github.com/dipu-bd/lightnovel-crawler/pull/2287
I was able to download 1.3k chaps at once without any significant issues. The chapters with HTTP 503 reported did have their content available so it seemed to have failed once out of the few retries it has per chapter in those instances but no blocking from cloudflare / captchas or the like.
Retrieving novel info...
📒 从时间停止开始纵横诸天
14 volumes and 1357 chapters found.
🔗 https://www.ddxsss.com/book/46000
? Enter output directory: /home/.../lightnovel-crawler/Lightnovels/www-ddxsss-com/Cong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian
? Which chapters to download? Everything! (1357 chapters)
? 1357 chapters selected Continue
? Which output formats to create? [epub]
? How many files to generate? Pack everything into a single file
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/148.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/433.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/451.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/457.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/927.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1135.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1150.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1225.html
Chapters: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1357/1357 [00:42<00:00, 32.19item/s]
Images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8.48item/s]
Created: Cong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian c1-1357.epub
✨ Task completed
Let us know
Novel URL: https://www.69xinshu.com/book/9969673.htm App Location: PIP | EXE | Discord | Telegram App Version: x.y.z
Describe this issue