apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.37k stars 659 forks source link

RequestError: unknown compression method #1248

Closed metalwarrior665 closed 2 years ago

metalwarrior665 commented 2 years ago

Describe the bug This website throws RequestError: unknown compression method every time. Not sure if this should be fixed in libraries or it is just an edge case. It works with Puppeteer.

To Reproduce

const startUrls = [{ url: 'https://www.tripplite.com/products/display-mount-accessories-dual-bar-mounts~56-203'}]
const requestList = await Apify.openRequestList('start-urls', startUrls);

const crawler = new Apify.CheerioCrawler({
    requestList,
    handlePageFunction: async (context) => {
        const { url, userData: { label } } = context.request;
        log.info('Page opened.', { label, url });
    },
});

log.info('Starting the crawl.');
await crawler.run();

Expected behavior A clear and concise description of what you expected to happen.

System information: {"apifyVersion":"2.1.0","apifyClientVersion":"2.0.2","osType":"Linux","nodeVersion":"v16.13.0"}

Additional context Sample run - https://console.apify.com/view/runs/le9zcQFmeteY00dnR

B4nan commented 2 years ago

cc @szmarczak

szmarczak commented 2 years ago
require('axios')('https://www.tripplite.com/products/display-mount-accessories-dual-bar-mounts~56-203', { decompress: true, headers: { 'accept-encoding': 'deflate' } }).then(x => x.headers).then(console.log)

I tried axios and the same happens. Not sure how browsers decode this.

szmarczak commented 2 years ago

Ok got it, it's missing zlib headers. inflateRaw works like a charm, but this (99%) won't be supported by Got. I guess we could retry with decompress: false?

mnmkng commented 2 years ago

Can we somehow integrate this retry into got-scraping? Those decompression errors are not frequent, but common. E.g. https://github.com/apify/apify-js/issues/373 https://github.com/apify/apify-js/issues/266

szmarczak commented 2 years ago

https://github.com/apify/got-scraping/pull/63

B4nan commented 2 years ago

Should be addressed in latest got-scraping via https://github.com/apify/got-scraping/pull/64