apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.42k stars 662 forks source link

[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

Open firecrauter opened 3 days ago

firecrauter commented 3 days ago

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

  1. Template typescript CheerioCrawleer
  2. URLs with regex in Starturl.
  3. Add preNavigationHooks and set a cookie in that URL
  4. npm install
  5. npm start

output:

...
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. SyntaxError: Invalid regular expression: /^/Antonov++Andrii/: Nothing to repeat
    at new RegExp (<anonymous>)
    at pathMatch (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\pathMatch.js:35:13)
    at matchRFC (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:68:51)
    at D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:87:13
    at Array.forEach (<anonymous>)
    at MemoryCookieStore.findCookies (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:82:17)
    at CookieJar.getCookies (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:536:15)
    at CookieJar.getCookieString (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:597:14)
    at CookieJar.callSync (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:168:16)
    at CookieJar.getCookieStringSync (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:610:22) {"id":"ZTnkJJu5aEw0Obe","url":"https://www.google.com/Antonov++Andrii/","retryCount":3}
INFO  CheerioCrawler: Error analysis: {"totalErrors":3,"uniqueErrors":1,"mostCommonErrors":["3x: Invalid regular expression: _ Nothing to repeat (<anonymous>)"]}
INFO  CheerioCrawler: Finished! Total 4 requests: 1 succeeded, 3 failed. {"terminal":true}

Code sample

// For more information, see https://crawlee.dev/
import { CheerioCrawler } from 'crawlee';

//Example of URLs with regex (even though it returns a 404):
const startUrls = [
    "https://www.appbrain.com/dev/Cibus+%7C+Pluxee/",
    "https://www.appbrain.com/dev/Y+C++S+T+U+D+I+O/",
    "https://www.appbrain.com/dev/Antonov++Andrii/",
    'https://www.appbrain.com/dev/Mobile+Dialer+%28+HelloBDTel+-Ten+Card+Company+%29'
];

const crawler = new CheerioCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: async ({ request, $, log }) => {
        const title = $('title').text();
        log.info(`${title}`, { url: request.loadedUrl });
    },
    errorHandler: async ({ }, _: Error) => {
        //        console.log(request.url);
    },
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 4,
    persistCookiesPerSession: false,
    preNavigationHooks: [
        (crawlingContext, _) => {
            // ...
            try {
                const { session, request } = crawlingContext;
                if (session) {
                    const cookieString = 'adlt=1;';

                    const urlWithoutPath = new URL(request.url);
                    urlWithoutPath.pathname = '/'; // Restablecer el path a solo "/"
                    const targetUrl = urlWithoutPath.toString();

                    session.setCookie(cookieString, targetUrl);
                }
            } catch (error) {
            }
        },
    ],
});

await crawler.run(startUrls);

Package version

3.11.4, 3.11.5

Node.js version

22.6.0

Operating system

Windows 11, and Ubuntu 24.04

Apify platform

I have tested this on the next release

No response

Other context

No response

firecrauter commented 1 day ago

Although I have set in session.setCookie the URL without path (just the domain, as specified in the tough-cookie documentation).

 if (session) {
 const urlWithoutPath = new URL(request.url);
 urlWithoutPath.pathname = '/';
 const targetUrl = urlWithoutPath.toString();

 const cookieString = 'adlt=1;';
 session.setCookie(cookieString, targetUrl);
}

on https://github.com/apify/crawlee/blob/99aa278f45141ad99f88abd7845584e0e7b60a87/packages/core/src/cookie_utils.ts#L134 the full URL is passed, as seen in the image:


If I modify these lines with this, it fixes the issue:

    const urlWithoutPath = new URL(url);
    urlWithoutPath.pathname = '/'; // Reset the path to just "/"
    return jar.getCookieStringSync(urlWithoutPath.toString());

as far as I understand so far, when using CookieJar, the URL's domain without the path should be passed.