Fix this source: www.wuxiaworld.com not working.

dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.

https://pypi.org/project/lightnovel-crawler/

GNU General Public License v3.0

1.42k stars 279 forks source link

Fix this source: www.wuxiaworld.com not working. #1579

Closed budikesuma closed 1 year ago

budikesuma commented 1 year ago

@dipu-bd Hello, dev,

I don't know what happened to the www.wuxiaworld.com website, but at the moment lncrawler can't find the novel chapters. It seems can't reach the address: https://api.wuxiaworld.com/wuxiaworld.api.v2.Novels/GetNovel

I tried termux, discord bot, and Google Colab, but all of them didn't work.

Test link: https://www.wuxiaworld.com/novel/first-immortal-of-the-sword . . Screenshot_20220915-172352_Termux . Screenshot_20220915-172944_Chrome . Screenshot_20220915-173004_Chrome Screenshot_20220915-173016_Chrome

dipu-bd commented 1 year ago

@idMysteries #1580 does not solve this issue. This is maybe a cloudflare issue, or their internal security was updated recently

idMysteries commented 1 year ago

When I checked, the error was in the api url. Changing the api to api2. But if it doesn't work now, then apparently it's cloudflare. Damn, I'm starting to hate cloudflare.

idMysteries commented 1 year ago

I think it's not cloudflare. @dipu-bd They don't have cookies in the request.

dipu-bd commented 1 year ago

Damn, I'm starting to hate cloudflare

welcome to the club

idMysteries commented 1 year ago

I sent the request again in the browser and it was not blocked.

dipu-bd commented 1 year ago

I sent the request again in the browser and it was not blocked.

This is gonna be difficult to find

idMysteries commented 1 year ago

But the program doesn't work for some reason. Maybe a request with an error inside? I'm 70% sure it's not cloudflare.

idMysteries commented 1 year ago

POST /wuxiaworld.api.v2.Novels/GetNovel HTTP/2
Host: api2.wuxiaworld.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0
Accept: */*
Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate, br
Referer: https://www.wuxiaworld.com/novel/the-max-level-hero-strikes-back
Content-Length: 38
Origin: https://www.wuxiaworld.com
DNT: 1
Connection: keep-alive
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: no-cors
Sec-Fetch-Site: same-site
TE: trailers
authorization: Bearer 1FB7C5BC5F3369F06062EDB20E25C8884E1E07A1E7DFD727F4D873C666FF4E07-1
content-type: application/grpc-web+proto
x-grpc-web: 1
Pragma: no-cache
Cache-Control: no-cache

idMysteries commented 1 year ago

@dipu-bd maybe wrong slug?

idMysteries commented 1 year ago

Okay, I'm too dumb for that. Maybe they have learned to find not real queries, or it's some kind of deception in the browser. Maybe there are some secret cookies. I can easily send requests in the browser. The browser shows that there are no cookies.

Perhaps people from wuxiaworld in the tens (millions) times smarter than me.

idMysteries commented 1 year ago

@dipu-bd But are they using a certificate from cloudflare?

idMysteries commented 1 year ago

I tried to fix it, but now I'm 90% sure it's cloudflare. T_T

budikesuma commented 1 year ago

@dipu-bd

I just used Google colab by forcing an install of lncrawler version 2.29.4 (Must be fresh install in the first step in the browser(you'll see the Red iPython warning at the bottom after installation), if the install overwrites another version, it will also end up with an error at the url https://api.), still managed to download the early chapters, although the last few chapters ended up empty/error. Maybe you need to check your old version files there? . Tested link: https://www.wuxiaworld.com/novel/law-of-space-and-time . . Screenshot_20220916-062311_Chrome

Screenshot_20220916-062337_Chrome . Screenshot_20220916-062404_Chrome

Screenshot_20220916-062426_Chrome

idMysteries commented 1 year ago

@dipu-bd what is this dark magic? Is the problem in the proto file (string)? I copied the code from version 2.29.4 and there is nothing interesting.

I'm not good at python's dark magic.

budikesuma commented 1 year ago

@idMysteries

Maybe it's not a problem at wuxiacom.py, maybe the problem is in the downloader.py section? 🤔

zerty commented 1 year ago

Ive got kind of an ugly solution idk if it can help: https://github.com/zerty/Wuxiaworld-to-epub Basically I use undetected-chromedriver and run JS in it to get the data using a POST I am using a modificatio of the sonora client.py

budikesuma commented 1 year ago

@dipu-bd @idMysteries

This article might be useful for this issue?

https://www.scrapingbee.com/blog/pyppeteer/

ShakeBake commented 1 year ago

Same issue for me on windows and termux.

budikesuma commented 1 year ago

I finally managed to get version 2.29.4 working on pyDroid3, to be able to scrape wuxiaworld.com again. I just couldn't scrape through the Translator's Thoughts section, as my python knowledge is zero, based on logic alone.😥

I suggest developers take another look at the code in that version, to re-implement it to be able to crawl wuxiaworld.com.