Closed szmarczak closed 2 years ago
Yeah, I kinda like the idea. Have you omitted the decompression from the PR just for now or do you expect the user to do it?
I think we should try a few decompressions and then perhaps cache the working one for the given domain, disable decompress for all requests and use the found decompression method instead.
Have you omitted the decompression from the PR just for now or do you expect the user to do it?
It completely omits compression, so there's nothing to decompress.
This is a very rare case, so I don't think it's worth implementing decompression mechanism since we can't transform chunks in Got directly. In Got, we would need to design an API for this first. Then, tests. It would be released as an experimental option anyway, and it would take weeks to stabilize.
use the found decompression method instead
This would be needed be done in the SDK. Something like this: https://github.com/node-fetch/node-fetch/blob/41f53b9065a00bc73d24215d42aacdcd284b199c/src/index.js#L262-L301
However requestAsBrowser
is exposed, so we would need more code for making the end stream look like a Got stream and somehow handle promises. tl;dr not worth it, let's completely omit compression.
I guess we could use LRU cache of hosts with 1000 max entries - it would disable compression automatically for those?
I get your points, but how do you imagine this solving the problem in the SDK. If we omit decompression completely, the parsing of the response will just throw. If we decompress it, it does not matter whether we do it in the SDK or here. Or does it?
If we can't reliably "automate the problem away" then I guess we could just leave it as is and guide the user to turning off decompression and then decompressing manually. What do you think?
how do you imagine this solving the problem in the SDK.
One way would be to make requestAsBrowser
return piped stream, something like node-fetch
does already internally. For promises we would probably need to get the entire response and then decompress it synchronously.
If we omit decompression completely, the parsing of the response will just throw.
Why would it throw?
If we decompress it, it does not matter whether we do it in the SDK or here. Or does it?
It just occured to me now that we could use Got handlers in order to fix streams and afterResponse
hook to fix promises. However still the end stream needs patching to implement Got-specific stream properties.
If we can't reliably "automate the problem away" then I guess we could just leave it as is and guide the user to turning off decompression and then decompressing manually. What do you think?
We can automate the problem away :) It's to retry the request without accepting compressed data, so the server is bound to send raw data. Though an entire server can work incorrectly, so it's necessary to cache what hosts need disabled compression in order to prevent duplicated requests.
Ok, I guess I just misunderstood.
If we omit decompression completely, the parsing of the response will just throw.
Why would it throw?
E.g. in CheerioCrawler
, because you can't parse a compressed response with Cheerio.
I was just referring to the fact that we need to decompress it somewhere, so it would be better to do it in here, rather than in the SDK. If it's impossible to do it here, then we can do it in the SDK, but I would prefer to do it here, because semantically it belongs here. This client should behave as close to a browser as possible, so being able to handle invalid compressions fits here best.
It just occured to me now that we could use Got handlers in order to fix streams and afterResponse hook to fix promises. However still the end stream needs patching to implement Got-specific stream properties.
Let's try that, if it's not too much of a hack. Not sure how crazy the patching will be :)
E.g. in CheerioCrawler, because you can't parse a compressed response with Cheerio.
The response can't be compressed since we won't send accept-encoding
because the decompress
option will be false
when retrying. The server will simply send raw data ready to consume.
Let's try that, if it's not too much of a hack. Not sure how crazy the patching will be :)
I'll prepare that in a second PR, to have a nice comparison.
Ok, finally I get it with the header. I didn't know that decompress: false
also turns off compression.
We have to be careful about this, because residential proxies are paid per GB of traffic so the bill for them would explode.
This works however if some entire website is incorrectly performing compression, then all requests would end up duplicated. I think we could do some caching in this regard (to know when to disable decompress)?
TODO: tests