apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.35k stars 658 forks source link

Enqueue strategy check after redirects is not working with adaptive crawler #2525

Open B4nan opened 4 months ago

B4nan commented 4 months ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

use enqueueLinks() without any parameters in the request handler on https://crawlee.dev/, at some point it will escape the domain and start scraping everything

https://console.apify.com/actors/PFaajt3k6oOp1YRAU/runs/0SfY5Ocr1dgQjhSIS#log

Code sample

import { PlaywrightCrawler } from 'crawlee';
import { Actor } from 'apify';

await Actor.init();

const crawler = new PlaywrightCrawler({
    proxyConfiguration: await Actor.createProxyConfiguration(),
});
crawler.router.addDefaultHandler(async (ctx) => {
    const $ = await ctx.parseWithCheerio();
    const title = $('html title').text();
    const h1 = $('body h1').text();
    const proxy = ctx.proxyInfo?.username;
    ctx.log.info(`processing ${ctx.request.url}`, { title, h1, proxy });
    await ctx.pushData({ url: ctx.request.url, title, h1 });
    await ctx.enqueueLinks();
});
await crawler.run(['https://crawlee.dev/']);
await Actor.exit();

Package version

3.10.3 beta

Node.js version

20

Operating system

No response

Apify platform

I have tested this on the next release

No response

Other context

No response

janbuchar commented 4 months ago

Thanks for the report! Are you aware if there is a page that redirects elsewhere somewhere in the crawlee docs, or is the actual enqueueStrategy check failing (and not the post-redirect check)?

B4nan commented 4 months ago

looking at the storage, it feels like its not about redirects, we have the edit this page links in there too

image

few more links here, i don't think they come from redirect either

image
B4nan commented 4 months ago

it almost feels like the adaptive enqueueLinks is not checking the strategies at all, maybe its not about the post-redirect check at all