apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.95k stars 624 forks source link

Wrong uniqueKey creation for Google Maps URL with "@" #1831

Closed metalwarrior665 closed 1 year ago

metalwarrior665 commented 1 year ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

The Request constructor messes up the uniqueKey creation. It converts 'https://www.google.com/maps/search/restaurant/@39.102972537998426,39.54835927707177,4z' into 'https://39.102972537998426,39.54835927707177,4z'

The first might be not up to HTTP specs but we should work around it. The same logic might exist on platform as well so we might need to fix it there as well

Code sample

const req = new Request({ url: 'https://www.google.com/maps/search/restaurant/@39.102972537998426,39.54835927707177,4z' })
console.dir(req)

/*
Request {
  id: undefined,
  url: 'https://www.google.com/maps/search/restaurant/@39.102972537998426,39.54835927707177,4z',
  loadedUrl: undefined,
  uniqueKey: 'https://39.102972537998426,39.54835927707177,4z',
  method: 'GET',
  payload: undefined,
  noRetry: false,
  retryCount: 0,
  errorMessages: [],
  headers: {},
  userData: [Getter/Setter],
  handledAt: undefined
}
*/

Package version

latest

Node.js version

16

Operating system

No response

Apify platform

I have tested this on the next release

No response

Other context

No response

B4nan commented 1 year ago

This comes from here:

https://github.com/apify/apify-shared-js/blob/master/packages/utilities/src/utilities.client.ts#L85

Not sure why we even have a custom URL parser instead of using new URL, I guess it's just too old code we haven't refactored yet. Not sure if it makes more sense to try to fix the regexps or instead try to reimplement with new URL? cc @mtrunkat @mnmkng

edit: looks like using new URL (at least for the normalization) is trivial, working on a PR

B4nan commented 1 year ago

There is one thing in the tests that won't be working exactly the same as before if we use new URL.

it('should trim all parts of URL', () => {
    expect(normalizeUrl('    http    ://     test    # fragment   ')).toEqual('http://test');
    expect(normalizeUrl('    http   ://     test    # fragment   ', true)).toEqual('http://test#fragment');
});

Not sure why we had this in the tests, http :// test # fragment is not a valid URL, the color character is part of the protocol:

new URL('http    ://     test    # fragment'); // fails
new URL('http:    //     test    # fragment'); // works

The code is 5 years old, and I guess it is rather testing the old normalizeUrl implementation, so it should be safe to change that.

edit: hmm new URL('http: // test # fragment') works in the browser, but not in nodejs, at least not on node 18 :/

I would be probably in favor of removing this test as a whole. We validate URLs internally in crawlee on many places via new URL, so why should we care about anything that is not passing this validation here...

B4nan commented 1 year ago

PR here: https://github.com/apify/apify-shared-js/pull/365

I stripped all the spaces from the URL before parsing to get around this, I guess that should do it.