apify / apify-storage-local-js

Local emulation of the apify-client NPM package, which enables local use of Apify SDK.
3 stars 4 forks source link

Question: huge SQLite files? #23

Open persona0591 opened 3 years ago

persona0591 commented 3 years ago

Hi guys,

Thanks for your awesome framework! 🥇

I have a question: I'm using the PuppeteerCrawler with local storage. However, I noticed that the SQLite files associated with the storage grow rapidly in size, notably the db.sqlite-wal file. Take the snippet underneath for example (no production code, just for illustration):

'use strict';

const Apify = require('apify');
const fs = require('fs');
const os = require('os');

Apify.main(async () => {
    const localStorageDir = os.tmpdir() + '/apify';
    process.env.APIFY_LOCAL_STORAGE_DIR = localStorageDir;

    const requestQueue = await Apify.openRequestQueue();
    await requestQueue.addRequest({ url: 'https://example.org' });
    await requestQueue.addRequest({ url: 'https://example.com' });
    await requestQueue.addRequest({ url: 'https://example.de' });
    await requestQueue.addRequest({ url: 'https://example.ru' });
    await requestQueue.addRequest({ url: 'https://example.gov' });

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        maxRequestRetries: 3,
        launchContext: {
            launchOptions: {
                headless: true,
                args: ['--no-sandbox'],
            },
        },
        handlePageFunction: async ({ request, response, page }) => {
            console.log('Handled!');
        },
        handleFailedRequestFunction: async ({ request, error }) => {
            console.log(error);
        },
    });

    await crawler.run();

    const pathToRequestQueueFiles = localStorageDir + '/request_queues/default';
    const filenames = await fs.promises.readdir(pathToRequestQueueFiles);
    for (const filename of filenames) {
        const fullFilename = pathToRequestQueueFiles + '/' + filename;
        console.log(`File: ${fullFilename}; size: ${fs.statSync(fullFilename).size / 1024} MB`);
    }
});

The code outputs the SQLite request queue files and their sizes. If I run this code it crawls a number of pages, including pages with non-exsiting or erroneous domains, in order to force the crawler to retry these. The output (on my machine):

File: /tmp/apify/request_queues/default/db.sqlite; size: 4 MB
File: /tmp/apify/request_queues/default/db.sqlite-shm; size: 32 MB
File: /tmp/apify/request_queues/default/db.sqlite-wal; size: 599.5234375 MB

The db.sqlite-wal file is huge (for just a couple of crawls). Unfortunately, I'm executing my crawler on a low-storage environment - and running out of disk space.

Is this something that can be solved? For example, would it be possible to use an in-memory database? Or would it be possible to not create this db.sqlite-wal file (or to have an option to not create it)?

Many thanks!

mnmkng commented 3 years ago

Whoah, that's a big file. The wal file is a journaling file which SQLite uses for speed optimization. We will add an option to turn this off to this library shortly and we will also make it configurable from the Apify SDK. Should be done in two weeks I guess.

Nice issue btw, thanks. Reads well, has all the important info.

persona0591 commented 3 years ago

Hi Ondra, thank you! I'm looking forward to this option.

persona0591 commented 3 years ago

Hi Ondra, out of curiosity: I've noticed that the ApifyStorageLocal class now has a new option, enableWalMode, which seems to resolve my issue! 🥳

However, do you know how I can use this option from the Apify SDK? For example, can I set an environment variable for this (similar to APIFY_LOCAL_STORAGE_DIR)?

mnmkng commented 3 years ago

Hey @persona0591, we'll have a PR that will allow you to configure this option in SDK soon.

persona0591 commented 3 years ago

Thanks for the update, Ondra!