apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.28k stars 650 forks source link

utils.downloadListOfUrls picks up gargabe URLs for default spreadsheet format #639

Closed metalwarrior665 closed 2 years ago

metalwarrior665 commented 4 years ago

Reproduce with:

const wrongUrl = 'https://docs.google.com/spreadsheets/d/11UGSBOSXy5Ov2WEP9nr4kSIxQJmH18zh-5onKtBsovU/edit?usp=sharing';
    const correctUrl = 'https://docs.google.com/spreadsheets/d/11UGSBOSXy5Ov2WEP9nr4kSIxQJmH18zh-5onKtBsovU/gviz/tq?tqx=out:csv';
    const wrongData = await Apify.utils.downloadListOfUrls({ url: wrongUrl });
    const correctData = await Apify.utils.downloadListOfUrls({ url: correctUrl });
    console.dir(wrongData);
    console.dir(correctData);

In the wrong case, it will print a lot of internal Google URLs. Actually, the wrong URL is what you get if you click on the Share button in your spreadsheet.

I think we could probably just convert the URL without touching the parsing.

metalwarrior665 commented 4 years ago

Something like this would fix it

let url = originalUrl;
const match = 'https://docs.google.com/spreadsheets/d/11UGSBOSXy5Ov2WEP9nr4kSIxQJmH18zh-5onKtBsovU/edit?usp=sharing'
.match(/^(https:\/\/docs\.google\.com\/spreadsheets\/d\/(?:\w|-)+)\/edit/);
if (match) {
    url = `${match[1]}/gviz/tq?tqx=out:csv`
}
mnmkng commented 4 years ago

You can pass a custom RegExp into the function. I don't feel like hardcoding Google Docs specific overrides into the function. Am I missing something?

metalwarrior665 commented 4 years ago

Well, in Scrapers and other generic actors that use this internally, the user can pass anything there (usually into the Start URLs input schema component) so creating custom regex doesn't make sense.

I will keep this issue open and observe if more people get to the same problem and if yes ,we should at least enhance the description/warning for the Start URLs file upload.

mnmkng commented 4 years ago

Oh, so the trouble is actually with the automatic parsing in RequestList. Yeah, well, that would deserve some update.

zpelechova commented 2 years ago

Hey, it takes the input from the url at the moment, but it also takes a lot of unrelated google urls, see here: https://console.apify.com/admin/users/xRGg9iAfJSymqartk/tasks/eaUCBXOfaYgzwAcDB#/runs/Lo4IEhIFzEpNfBOtS . @B4nan