benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

Persistent "Error: 503" on mangalib.me #77

Closed Baltazar500 closed 2 years ago

Baltazar500 commented 3 years ago

What is wrong with the mangalib.me resource that xidel cannot get data from it ? Crazy cloudflare protection ?

xidel -se "//title" 'https://mangalib.me' Error: Internet/HTTP Error: 503 when talking to: https://mangalib.me/

On windows and Linux versions the same thing

Reino17 commented 3 years ago

That website doesn't exist as far as I can see. My browser returns "404 Not Found".

Baltazar500 commented 3 years ago

That website doesn't exist as far as I can see. My browser returns "404 Not Found".

I open https://mangalib.me through the browser without any problems.

Problems via xidel (error 503) and via some (old, probably) versions of curl (there cloudlare html is returned).

Maybe this site also has an additional regional blocking :( ?

2021 08 28_20 33 04~02

Reino17 commented 3 years ago

A geoblock could very well be possible. I don't know if xidel can bypass the Cloudflare protection. Might be quite hard to accomplish if it is possible.

Baltazar500 commented 3 years ago

In my case, the problem is not in the geoblock (this site opens in a browser). When comparing what I get in different versions of curl (in some it returns html mangalib.me, in others it returns html cloudflare + 403 error), the problem is probably either in http/2 or in ssl and cloudflare protection with a positive response to probable ddos >_< He also reacts to xidel

benibela commented 3 years ago

Cloudfare often uses Javascript, that would not work in Xidel

http/2 is also not supported

If you are lucky, it only checks the user agent

Baltazar500 commented 3 years ago

If you are lucky, it only checks the user agent

Custom user-agent is set in curlrc, however, the output is different

http/2 is also not supported

I'm not sure if this is the case. However, see:

Old curl 7.26.0, no data received. (Just a moment...) out-01.txt

New curl 7.78.0, same OS no data received. (Please Wait... | Cloudflare) out-02.txt

Not very old curl 7.67.0, different OS, data received (Манга. Читать мангу онлайн на русском. Манга онлайн!) out-03.txt

benibela commented 3 years ago

Custom user-agent is set in curlrc,

Xidel ignores that file

I'm not sure if this is the case. However, see:

Xidel with OpenSSL does not support HTTP/2.

I do not know about Xidel with wininet. Microsoft has a big message on the wininet documentation page "For app containers since Windows 10, version 1709, HTTP/2 (see RFC7540) is on by default.", but then the only list HTTP 1.0 and 1.1 as supported versions

Old curl 7.26.0, no data received. (Just a moment...) out-01.txt

New curl 7.78.0, same OS no data received. (Please Wait... | Cloudflare) out-02.txt

Not very old curl 7.67.0, different OS, data received (Манга. Читать мангу онлайн на русском. Манга онлайн!) out-03.txt

Cloudfare is always weird

Baltazar500 commented 3 years ago

Xidel ignores that file

I know. xidel fails and I use curl to get data

I do not know about Xidel with wininet. Microsoft has a big message on the wininet documentation page "For app containers since Windows 10, version 1709, HTTP/2 (see RFC7540) is on by default.", but then the only list HTTP 1.0 and 1.1 as supported versions

I am using *nix systems, not windows and I am not sure if the problem is with http/2 (curl 7.78.0 does not receive data, but it successfully supports http/2)

Cloudfare is always weird

The problem is caused by cloudflare, but for some reason, in the case of curl 7.67, the data is received successfully (same external IP address, different system).

Baltazar500 commented 3 years ago

Probably after working on the mangalib.me servers on September 13 or manipulating the DPI system by providers, the hack with curl 7.67.0 stopped working. Instead of a title, I now get "Please Wait ... | Cloudflare" :(

I found a discussion of a similar case on the Internet here https://curl.se/mail/archive-2021-05/0003.html but there is still no solution to the problem was found :(

benibela commented 2 years ago

Then there is nothing I can do about it

Baltazar500 commented 2 years ago

Then there is nothing I can do about it

But a browser with noscript enabled (cloudflare captcha and other scripts are disabled) receives the mangalib.me site page :/