apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.39k stars 662 forks source link

Error writing file '****' in directory '****/apify_storage/key_value_stores/default' referred by APIFY_LOCAL_STORAGE_DIR environment variable: ENOENT: no such file or directory, scandir #320

Closed maximepvrt closed 5 years ago

maximepvrt commented 5 years ago

Hi, I have error if I reassign city attribute with await api; if i remove that, my code works fine ... Any idea ? Best regards

async function getCityId(city) {
    const response = await axios.get('https://geo.api.gouv.fr/communes', {
        params: {
            nom: city,
        },
    });
    return response.data[0].code;
}
const b = await getCityId(a.city);
a.city = b;
console.log(a);
await Apify.setValue(request.url.split('/').pop(), a);
mnmkng commented 5 years ago

Hi @maximepvrt, does the directory

.../apify_storage/key_value_stores/default

exist or not in your project folder?

Did you create your project using apify create and did you run it using apify run?

mnmkng commented 5 years ago

Also, could you provide more context? E.g. the a variable is not defined in the example. Are there other places in the code that mutate the a variable?

I'm thinking that, perhaps, the issue might be in parallel crawling and subsequent parallel mutation of the a object.

maximepvrt commented 5 years ago

Thanks for the response time :-) Yes, the directory exists and my script worked correctly before. I was editing my script to update city with an API in my scraped object. The Console.log function works and displays the modification of the city, but can not save it in key_value_stores. If I delete the actual city, it works again ... I do not understand :(

maximepvrt commented 5 years ago

Give me your email if you want the complete file to test :p Thanks

mnmkng commented 5 years ago

You can reach me at ondra@apify.com

maximepvrt commented 5 years ago

Email sent 🛫

mnmkng commented 5 years ago

I ran your code and it works fine for me, both with and without reassignment of the city attribute.

What version of Apify are you using? Is it possible that the path you're getting is wrong somehow? Perhaps an extra / somewhere? Did you try to reinstall all packages after deleting package-lock.json?

Just as a note. The return value of handlePageFunction is discarded. If you wish to save the crawled data, you should use await Apify.pushData() which saves data to the default dataset.

Also, you can use the request.userData object to store request scoped data. So instead of using startsWith(...) you can use for example

await requestQueue.addRequest({ url: 'some-url', userData: { label: 'my-page' } });

// ... in handlePageFunction

const { label } = request.userData;
if ( label === 'my-page') { ... }

// ... or even better using a router object with functions for each label

router[label](page, ...);
maximepvrt commented 5 years ago

Thanks for your help ! After removing node_modules and yarn.lock, it works for me too :-) I was using the latest version of apify before. It's crazy !

Thanks for optimization recommendations!