apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.59k stars 666 forks source link

Node internal uncaughtException #1172

Closed laster04 closed 3 years ago

laster04 commented 3 years ago

Describe the bug When i want to get body of response i get this error

node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

immediate._onImmediate: Protocol error (Network.getResponseBody): No resource with given identifier found
    at processImmediate (node:internal/timers:464:21)
    at process.callbackTrampoline (node:internal/async_hooks:130:17) {
  name: 'Error'
}
error Command failed with exit code 1.

To Reproduce

 await page.on('response', async (r) => {
                if (r.url().includes('hoteldetail/rooms/')) {
                    const text = await r.text();
                    log.info(r.url());
                }
            });

Expected behavior Get respone body as a text.

System information:

szmarczak commented 3 years ago

Looks like a puppeteer or playwright bug. Can you please post full code to reproduce? The one you posted is too short.

pocesar commented 3 years ago

@laster04 never throw inside EventListeners as they are synchronous and don't await the listener callbacks. If you must, you can use a deferred to throw a promise from inside, you need a try/catch there https://github.com/pocesar/actor-facebook-scraper/blob/ac6229408cb96bd46b19daa2164b4b6fb6286c4d/src/page.ts#L339-L353

laster04 commented 3 years ago

Looks like a puppeteer or playwright bug. Can you please post full code to reproduce? The one you posted is too short.

Here is more code:

const requestQueue = await Apify.openRequestQueue();
    const proxyConfiguration = await Apify.createProxyConfiguration({
        groups: ['RESIDENTIAL'],
        countryCode: 'CN',
    });

    /** @type {Apify.Session} */
    let stickySession;

    await requestQueue.addRequest({
        url: 'https://m.ctrip.com/webapp/hotel/hoteldetail/65822792.html?days=3&atime=202109-19&contrl=0&num=undefined&biz=undefined',
    });

    // Create route
    const router = createRouter({ storeId });
    const crawler = new Apify.PlaywrightCrawler({
        requestQueue,
        proxyConfiguration,
        useSessionPool: true,
        maxRequestRetries: 15,
        sessionPoolOptions: {
            maxPoolSize: 1,
            createSessionFunction: async (sessionPool) => {
                stickySession = stickySession || new Apify.Session({ sessionPool });

                return stickySession;
            },
        },
        launchContext: {
            useChrome: true,
            launchOptions: {
                headless: true,
            },
        },
        handlePageTimeoutSecs: 360,
        browserPoolOptions: {
            maxOpenPagesPerBrowser: 0, // required to use one IP per tab
            preLaunchHooks: [async (pageId, launchContext) => {
                launchContext.launchOptions = {
                    ...launchContext.launchOptions,
                    viewport: {
                        height: 480,
                        width: 320,
                    },
                    // eslint-disable-next-line max-len
                    userAgent: 'Mozilla/5.0 (Linux; U; Android 3.2; nl-nl; GT-P6800 Build/HTJ85B) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13',
                    bypassCSP: true,
                    ignoreHTTPSErrors: true,
                    hasTouch: true,
                    isMobile: true,
                    deviceScaleFactor: 1,
                };
            }],
        },
        autoscaledPoolOptions: {
            desiredConcurrency: 1,
            maxConcurrency: 1,
        },
        preNavigationHooks: [async (context, gotoOptions) => {
            gotoOptions.waitUntil = 'domcontentloaded';
        }],
        persistCookiesPerSession: true,
        handlePageFunction: async (context) => {
            const { page, request } = context;
            if (page.url().includes('account')) {
                throw new Error('Redirected to login');
            }
            let responseRoomsBody = '';
            await page.on('response', async (response) => {
                if (response.url().includes('hoteldetail/rooms/')) {
                    try {
                        responseRoomsBody = await response.text();
                    } catch (e) {
                        // throw e;
                    }
                }
            });

            await page.waitForFunction(() => {
                return !!(window?.__HOTEL_PAGE_DATA__?.roomlistinfo);
            }, {}, { polling: 'raf', timeout: 30000 }); // with Chinese residential proxy it is very slow
            log.info(`URL Opened: ${request.url}`);
            await router(request.userData.label, context);
        },
    });

    await crawler.run();
szmarczak commented 3 years ago

Does it still crash with the try/catch you have added?

mnmkng commented 3 years ago

Closing since this was most likely unrelated to the SDK.