apify / got-scraping

HTTP client made for scraping based on got.
526 stars 40 forks source link

fix: retry without decompress #63

Closed szmarczak closed 2 years ago

szmarczak commented 2 years ago

This works however if some entire website is incorrectly performing compression, then all requests would end up duplicated. I think we could do some caching in this regard (to know when to disable decompress)?

TODO: tests

mnmkng commented 2 years ago

Yeah, I kinda like the idea. Have you omitted the decompression from the PR just for now or do you expect the user to do it?

I think we should try a few decompressions and then perhaps cache the working one for the given domain, disable decompress for all requests and use the found decompression method instead.

szmarczak commented 2 years ago

Have you omitted the decompression from the PR just for now or do you expect the user to do it?

It completely omits compression, so there's nothing to decompress.

This is a very rare case, so I don't think it's worth implementing decompression mechanism since we can't transform chunks in Got directly. In Got, we would need to design an API for this first. Then, tests. It would be released as an experimental option anyway, and it would take weeks to stabilize.

use the found decompression method instead

This would be needed be done in the SDK. Something like this: https://github.com/node-fetch/node-fetch/blob/41f53b9065a00bc73d24215d42aacdcd284b199c/src/index.js#L262-L301

However requestAsBrowser is exposed, so we would need more code for making the end stream look like a Got stream and somehow handle promises. tl;dr not worth it, let's completely omit compression.

szmarczak commented 2 years ago

I guess we could use LRU cache of hosts with 1000 max entries - it would disable compression automatically for those?

mnmkng commented 2 years ago

I get your points, but how do you imagine this solving the problem in the SDK. If we omit decompression completely, the parsing of the response will just throw. If we decompress it, it does not matter whether we do it in the SDK or here. Or does it?

If we can't reliably "automate the problem away" then I guess we could just leave it as is and guide the user to turning off decompression and then decompressing manually. What do you think?

szmarczak commented 2 years ago

how do you imagine this solving the problem in the SDK.

One way would be to make requestAsBrowser return piped stream, something like node-fetch does already internally. For promises we would probably need to get the entire response and then decompress it synchronously.

If we omit decompression completely, the parsing of the response will just throw.

Why would it throw?

If we decompress it, it does not matter whether we do it in the SDK or here. Or does it?

It just occured to me now that we could use Got handlers in order to fix streams and afterResponse hook to fix promises. However still the end stream needs patching to implement Got-specific stream properties.

If we can't reliably "automate the problem away" then I guess we could just leave it as is and guide the user to turning off decompression and then decompressing manually. What do you think?

We can automate the problem away :) It's to retry the request without accepting compressed data, so the server is bound to send raw data. Though an entire server can work incorrectly, so it's necessary to cache what hosts need disabled compression in order to prevent duplicated requests.

mnmkng commented 2 years ago

Ok, I guess I just misunderstood.

If we omit decompression completely, the parsing of the response will just throw.

Why would it throw?

E.g. in CheerioCrawler, because you can't parse a compressed response with Cheerio.

I was just referring to the fact that we need to decompress it somewhere, so it would be better to do it in here, rather than in the SDK. If it's impossible to do it here, then we can do it in the SDK, but I would prefer to do it here, because semantically it belongs here. This client should behave as close to a browser as possible, so being able to handle invalid compressions fits here best.

It just occured to me now that we could use Got handlers in order to fix streams and afterResponse hook to fix promises. However still the end stream needs patching to implement Got-specific stream properties.

Let's try that, if it's not too much of a hack. Not sure how crazy the patching will be :)

szmarczak commented 2 years ago

E.g. in CheerioCrawler, because you can't parse a compressed response with Cheerio.

The response can't be compressed since we won't send accept-encoding because the decompress option will be false when retrying. The server will simply send raw data ready to consume.

Let's try that, if it's not too much of a hack. Not sure how crazy the patching will be :)

I'll prepare that in a second PR, to have a nice comparison.

mnmkng commented 2 years ago

Ok, finally I get it with the header. I didn't know that decompress: false also turns off compression.

We have to be careful about this, because residential proxies are paid per GB of traffic so the bill for them would explode.