Closed corford closed 3 years ago
Regardless of the specific example above, it would be awesome in general to have the option (at the Request level rather than Crawler level) to signal you want a Buffer back from handlePageFunction() even if the response is valid text/html and thus correctly parseable by htmlparser
/cheerio.load()
Sadly, the Request
schema is closely tied with the Apify API and therefore any change to the Request
itself would take a long time and a lot of people would want to have their say in whether it's a good idea or not to add a property like responseBodyAsABuffer
. We might get to adding a crawlerOptions
property or something like that in the end, but it will take time.
Now, that does not solve your problem, but I wonder why not use BasicCrawler
? The point of CheerioCrawler
is to handle all that parsing/encoding logic and provide you with a ready to use $
object in the handlePageFunction
.
Since you must know which URLs are problematic beforehand (to be able to add that responseBodyAsABuffer
to Request
), BasicCrawler
should do just fine. You could save any metadata you need into userData
and do your parsing accordingly.
Or am I missing something?
I worried introducing something like responseBodyAsBuffer
would probably not be a trivial addition :)
You're right we could use BasicCrawler. We're currently using CheerioCrawler for this site since most of the resources we crawl from it are text/html and we make use of $. There is one particular endpoint we also scrape from the site (by adding it manually to the RequestQueue) that replies with broken JSON (as outlined above) and it was just nicer/smoother for us to stick with Cheerio and flag that particular request.
Rather than maintain a fork just for this, I guess in the short term we will re-write to use BasicCrawler.
You can use requestAsBrowser
to fetch the JSON easily in the BasicCrawler
. In CheerioCrawler
it uses streams, but you don't need to worry about that unless you need that extra memory optimization.
const cheerio = require('cheerio');
// ...
const { body } = await Apify.utils.requestAsBrowser({
url: 'endpoint/url',
json: true,
})
const $ = cheerio.load(body.data);
I'll close this, since we won't be implementing the short-circuit behavior.
Yep, we saw the json: true
option that gets handed down to Got but because CheerioCrawler uses streams we couldn't go that way. We'll take your advice and re-code using BasicCrawler rather than CheerioCrawler
Note that there is no rule that you can run only one crawler per script or anything like that so feel free to run the "ok" requests with Cheerio
and only the problematic ones with Basic
.
@mnmkng I've just had a look again at BasicCrawler and I remember now why we didn't want to go for it.
With CheerioCrawler, you get a lot of useful plumbing for free (via the _handleRequestFunction
, _requestFunction
and _getRequestOptions
functions).
For simple problems like a misconfigured site sending the wrong content-type header on some resources, it's a bit of a pain to introduce BasicCrawler as a workaround (...doubly so in our case because we have some light abstraction code over the top of CheerioCrawler and PuppeteerCrawler; which makes adding BasicCrawler in to the mix quite a pain).
Would you entertain making parseContentTypeFromResponse()
overridable (via the opts to CheerioCrawler)? Basically to give an escape hatch for small but irritating issues that ordinarily break CheerioCrawler's default content type parsing approach (a little like the forceResponseEncoding
option that already exists).
To illustrate what I mean, I've made a PR here: https://github.com/apifytech/apify-js/pull/698
Using this approach solves our issue while also avoiding needing to alter the Request
schema or bring BasicCrawler in to the mix. Example:
const { utils: { parseContentTypeFromResponse } } = Apify;
const crawler = new Apify.CheerioCrawler({
// ...
parseContentTypeFunction: (request, responseStream) => {
if (request.userData.parseAsJSON) {
return {
type: 'application/json',
charset: 'utf-8',
}
}
return parseContentTypeFromResponse(responseStream);
}
});
I'm sorry but we're a bit too busy with other projects to quickly consider the pros and cons of adding a new function to CheerioCrawler
API. Personally, I find the configuration already quite complicated for the "average user", but I'd need insights from other members of the team as well. I'm not saying we won't do it in some format, just that we don't have the capacity to have that discussion right now.
How critical is this for you? Do you have a viable workaround?
I understand. It's critical enough that for the short term its easier for us to use a fork of Apify with my PR than introduce BasicCrawler to fix it.
I'll keep the issue open and I think we might be able to get to this next month. At the beginning, CheerioCrawler
was really a simple wrapper of BasicCrawler
, but people keep requesting more and more functionality, so it might be time to rethink the architecture a bit.
Maybe by rewriting the whole flow to a middleware pattern, where one would be able to plug or replace pieces of the logic. It's related to #635 as well.
From our point of view, CheerioCrawler works very well in situations where you need to scrape a relatively simple site that returns html and/or json. We always reach for it first instead of the bare bones BasicCrawler (which we view more as a tool of last resort for non-standard/specialist crawls). Viewed like this, the "over configurability" of CheerioCrawler is actually an advantage for us as it lets us default/standardise on CheerioCrawler for a wide range of crawl tasks.
Also, the fact that CheerioCrawler has a somewhat similar API to PuppeteerCrawler is very helpful (i.e. they both have handlePageFunction
and handleFailedRequestFunction
; CheerioCrawler's prepareRequestFunction
gives you a similar hook point to PupetteerCrawler's gotoFunction
etc.). And in both cases most of the "low level" request/response plumbing is already taken care of for us (which is less true for BasicCrawler). This has made it easy to write some of our own generic middleware-esque code that sits on top of the two crawlers.
Literally our only major annoyance so far with CheerioCrawler is the limitation we're hitting now where there's no quick way to help it parse correctly in the face of a sightly broken site.
@mnmkng Just bumping this issue again as we ran into another instance of this today with a badly configured site.
One of the endpoints we scrape from it returns JSON but wrongly signals this as text/html
in the Content-Type header response. Sadly for us, the JSON also happens to contain a '<' in one of the field values and this blows up the body parsing in Apify (causes anything after the '<' to be chopped thus invalidating the json i.e. making it not possible to use JSON.parse()
on the body).
Using our fork of Apify (that incorporates https://github.com/apify/apify-js/pull/698 ) fixes the issue.
One way or another, it would be great if there was a way to explicitly override how CheerioCrawler parses a particular response.
We have a big release coming for the end of October, I'll see if we can squeeze this into it.
Thanks @mnmkng , that would be amazing. Looking foward to seeing what's in the big new release!
Sadly, it's mostly a lot of internal changes 🙂
@corford We're a bit late on schedule, but would you be happy with a solution like the following?
const cheerioCrawlerOptions = {
// ...
postResponseFunction: (ctx) => {
if (ctx.request.userData.parseAsJSON) {
ctx.response.headers['content-type'] = 'application/json; charset=utf-8'
}
}
}
If we're adding a function to the crawler interface, I'd like it to be more generic and useful than just parseContentTypeFunction
. Would that work for you?
@mnmkng A more generic solution sounds sensible and it could work I think, depending on where you are thinking of placing this hook?
Would it execute from within the Crawlers existing _requestFunction()
?
Yes, it would be the first thing to run after receiving the HTTP response. Before any additional work, such as parsing content type, response encoding etc.
Ok, that sounds perfect 👍 👍
Now describe the bug CheerioCrawler's
_requestFunction
makes it very hard to workaround broken sites that reply withContent-Type: text/html
and a JSON payload that includes escaped HTML.To Reproduce Request an URL whose response headers include
Content-Type: text/html
and the response body is a JSON payload that includes some escaped HTML. Example response payload:Desired behavior
handlePageFunction()
returns thebody
as a Buffer so we can apply our own content parsing logic (e.g. in the case above: JSON.parse(ctx.body.toString()). The desire to short-circuit CheerioCrawler's automatic type parsing rules for a given URL could be signaled with a newRequest
option (e.g.responseBodyAsBuffer: true
).Current behaviour
handlePageFunction()
returns thebody
as mangled HTML because the JSON has been wrongly parsed by htmlparser/cheerio.load(). It becomes hard to safely unmangle this later.System information:
Additional context Although I don't expect you to merge "as is", I've created a PR showing how we currently work around this issue for a particular REST endpoint we scrape: https://github.com/apifytech/apify-js/pull/696