apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.39k stars 661 forks source link

CheerioCrawler option to override utils.parseContentTypeFromResponse() #695

Closed corford closed 3 years ago

corford commented 4 years ago

Now describe the bug CheerioCrawler's _requestFunction makes it very hard to workaround broken sites that reply with Content-Type: text/html and a JSON payload that includes escaped HTML.

To Reproduce Request an URL whose response headers include Content-Type: text/html and the response body is a JSON payload that includes some escaped HTML. Example response payload:

{
    "status":200,
    "errorMessage":null,
    "callID":1,
    "service":"getProducts",
    "data":{
        "paginationHTML":"<!-- indexer::stop -->\n<div class=\"col-md-12 pagination-container\">\n<div class=\"color_wrapper\">\n<ul class=\"pagination\">\n<li><span class=\"current\">1<\/span><\/li><\/ul>\n<\/div>\n<\/div>"
    }
}

Desired behavior handlePageFunction() returns the body as a Buffer so we can apply our own content parsing logic (e.g. in the case above: JSON.parse(ctx.body.toString()). The desire to short-circuit CheerioCrawler's automatic type parsing rules for a given URL could be signaled with a new Request option (e.g. responseBodyAsBuffer: true).

Current behaviour handlePageFunction() returns the body as mangled HTML because the JSON has been wrongly parsed by htmlparser/cheerio.load(). It becomes hard to safely unmangle this later.

System information:

Additional context Although I don't expect you to merge "as is", I've created a PR showing how we currently work around this issue for a particular REST endpoint we scrape: https://github.com/apifytech/apify-js/pull/696

corford commented 4 years ago

Regardless of the specific example above, it would be awesome in general to have the option (at the Request level rather than Crawler level) to signal you want a Buffer back from handlePageFunction() even if the response is valid text/html and thus correctly parseable by htmlparser/cheerio.load()

mnmkng commented 4 years ago

Sadly, the Request schema is closely tied with the Apify API and therefore any change to the Request itself would take a long time and a lot of people would want to have their say in whether it's a good idea or not to add a property like responseBodyAsABuffer. We might get to adding a crawlerOptions property or something like that in the end, but it will take time.

Now, that does not solve your problem, but I wonder why not use BasicCrawler? The point of CheerioCrawler is to handle all that parsing/encoding logic and provide you with a ready to use $ object in the handlePageFunction.

Since you must know which URLs are problematic beforehand (to be able to add that responseBodyAsABuffer to Request), BasicCrawler should do just fine. You could save any metadata you need into userData and do your parsing accordingly.

Or am I missing something?

corford commented 4 years ago

I worried introducing something like responseBodyAsBuffer would probably not be a trivial addition :)

You're right we could use BasicCrawler. We're currently using CheerioCrawler for this site since most of the resources we crawl from it are text/html and we make use of $. There is one particular endpoint we also scrape from the site (by adding it manually to the RequestQueue) that replies with broken JSON (as outlined above) and it was just nicer/smoother for us to stick with Cheerio and flag that particular request.

Rather than maintain a fork just for this, I guess in the short term we will re-write to use BasicCrawler.

mnmkng commented 4 years ago

You can use requestAsBrowser to fetch the JSON easily in the BasicCrawler. In CheerioCrawler it uses streams, but you don't need to worry about that unless you need that extra memory optimization.

const cheerio = require('cheerio');

// ...

const { body } = await Apify.utils.requestAsBrowser({
    url: 'endpoint/url',
    json: true,
})

const $ = cheerio.load(body.data);

I'll close this, since we won't be implementing the short-circuit behavior.

corford commented 4 years ago

Yep, we saw the json: true option that gets handed down to Got but because CheerioCrawler uses streams we couldn't go that way. We'll take your advice and re-code using BasicCrawler rather than CheerioCrawler

mnmkng commented 4 years ago

Note that there is no rule that you can run only one crawler per script or anything like that so feel free to run the "ok" requests with Cheerio and only the problematic ones with Basic.

corford commented 4 years ago

@mnmkng I've just had a look again at BasicCrawler and I remember now why we didn't want to go for it.

With CheerioCrawler, you get a lot of useful plumbing for free (via the _handleRequestFunction, _requestFunction and _getRequestOptions functions).

For simple problems like a misconfigured site sending the wrong content-type header on some resources, it's a bit of a pain to introduce BasicCrawler as a workaround (...doubly so in our case because we have some light abstraction code over the top of CheerioCrawler and PuppeteerCrawler; which makes adding BasicCrawler in to the mix quite a pain).

Would you entertain making parseContentTypeFromResponse() overridable (via the opts to CheerioCrawler)? Basically to give an escape hatch for small but irritating issues that ordinarily break CheerioCrawler's default content type parsing approach (a little like the forceResponseEncoding option that already exists).

To illustrate what I mean, I've made a PR here: https://github.com/apifytech/apify-js/pull/698

Using this approach solves our issue while also avoiding needing to alter the Request schema or bring BasicCrawler in to the mix. Example:

const { utils: { parseContentTypeFromResponse } } = Apify;
const crawler = new Apify.CheerioCrawler({
    // ...
    parseContentTypeFunction: (request, responseStream) => {
      if (request.userData.parseAsJSON) {
        return {
          type: 'application/json',
          charset: 'utf-8',
        }
      }

      return parseContentTypeFromResponse(responseStream);
    }
});
mnmkng commented 4 years ago

I'm sorry but we're a bit too busy with other projects to quickly consider the pros and cons of adding a new function to CheerioCrawler API. Personally, I find the configuration already quite complicated for the "average user", but I'd need insights from other members of the team as well. I'm not saying we won't do it in some format, just that we don't have the capacity to have that discussion right now.

How critical is this for you? Do you have a viable workaround?

corford commented 4 years ago

I understand. It's critical enough that for the short term its easier for us to use a fork of Apify with my PR than introduce BasicCrawler to fix it.

mnmkng commented 4 years ago

I'll keep the issue open and I think we might be able to get to this next month. At the beginning, CheerioCrawler was really a simple wrapper of BasicCrawler, but people keep requesting more and more functionality, so it might be time to rethink the architecture a bit.

Maybe by rewriting the whole flow to a middleware pattern, where one would be able to plug or replace pieces of the logic. It's related to #635 as well.

corford commented 4 years ago

From our point of view, CheerioCrawler works very well in situations where you need to scrape a relatively simple site that returns html and/or json. We always reach for it first instead of the bare bones BasicCrawler (which we view more as a tool of last resort for non-standard/specialist crawls). Viewed like this, the "over configurability" of CheerioCrawler is actually an advantage for us as it lets us default/standardise on CheerioCrawler for a wide range of crawl tasks.

Also, the fact that CheerioCrawler has a somewhat similar API to PuppeteerCrawler is very helpful (i.e. they both have handlePageFunction and handleFailedRequestFunction; CheerioCrawler's prepareRequestFunction gives you a similar hook point to PupetteerCrawler's gotoFunction etc.). And in both cases most of the "low level" request/response plumbing is already taken care of for us (which is less true for BasicCrawler). This has made it easy to write some of our own generic middleware-esque code that sits on top of the two crawlers.

Literally our only major annoyance so far with CheerioCrawler is the limitation we're hitting now where there's no quick way to help it parse correctly in the face of a sightly broken site.

corford commented 4 years ago

@mnmkng Just bumping this issue again as we ran into another instance of this today with a badly configured site.

One of the endpoints we scrape from it returns JSON but wrongly signals this as text/html in the Content-Type header response. Sadly for us, the JSON also happens to contain a '<' in one of the field values and this blows up the body parsing in Apify (causes anything after the '<' to be chopped thus invalidating the json i.e. making it not possible to use JSON.parse() on the body).

Using our fork of Apify (that incorporates https://github.com/apify/apify-js/pull/698 ) fixes the issue.

One way or another, it would be great if there was a way to explicitly override how CheerioCrawler parses a particular response.

mnmkng commented 4 years ago

We have a big release coming for the end of October, I'll see if we can squeeze this into it.

corford commented 4 years ago

Thanks @mnmkng , that would be amazing. Looking foward to seeing what's in the big new release!

mnmkng commented 4 years ago

Sadly, it's mostly a lot of internal changes 🙂

mnmkng commented 4 years ago

@corford We're a bit late on schedule, but would you be happy with a solution like the following?

const cheerioCrawlerOptions =  {
    // ...
    postResponseFunction: (ctx) => {
        if (ctx.request.userData.parseAsJSON) {
            ctx.response.headers['content-type'] = 'application/json; charset=utf-8'
        }
    }
}

If we're adding a function to the crawler interface, I'd like it to be more generic and useful than just parseContentTypeFunction. Would that work for you?

corford commented 4 years ago

@mnmkng A more generic solution sounds sensible and it could work I think, depending on where you are thinking of placing this hook?

Would it execute from within the Crawlers existing _requestFunction() ?

mnmkng commented 4 years ago

Yes, it would be the first thing to run after receiving the HTTP response. Before any additional work, such as parsing content type, response encoding etc.

corford commented 4 years ago

Ok, that sounds perfect 👍 👍

mnmkng commented 3 years ago

828