apify / apify-storage-local-js

Local emulation of the apify-client NPM package, which enables local use of Apify SDK.
3 stars 4 forks source link

SDK_CRAWLER_STATISTICS_0 values are cumulative - not being overwritten #33

Closed MSIH closed 3 years ago

MSIH commented 3 years ago

When running locally on linux, If I run the same job (npm run start), the file SDK_CRAWLER_STATISTICS_0 does not get overwritten, instead, the data values are added/cumulative. Specifically, itemCount. Run first time and crawl 10 items, itemCount=10. Run second time and crawl 10 items, itemCount=20.

    const getStats = await Apify.getValue('SDK_CRAWLER_STATISTICS_0');
    console.dir(getStats);

See https://github.com/apify/apify-js/blob/master/src/crawlers/statistics.js

mnmkng commented 3 years ago

Yes, that is expected. If you want separate stats for each run you should clear the storages either by running apify run -p, deleting the file, or programmatically by deleting this specific value from the key-value store.

MSIH commented 3 years ago

I understand that is how the sdk works but I would not say most people would consider that "expected" behavior when running locally. On the Apify platform, each run has own storage folder, but not the case when run locally.

I would of thought that the last number would increment each time the crawl was run.

I did do a work around.

 const getStats = await Apify.getValue('SDK_CRAWLER_STATISTICS_0');

        if (getStats !== null) {
            getStats.requestMinDurationPerSeconds = (getStats.requestMinDurationMillis / 1000);
            getStats.requestMaxDurationPerSeconds = (getStats.requestMaxDurationMillis / 1000);
            getStats.requestAvgFinishedDurationPerSeconds = (getStats.requestAvgFinishedDurationMillis / 1000);
            getStats.requestTotalDurationMinutes = (getStats.requestTotalDurationMillis / 1000 / 60);
            getStats.requestTotalDurationHours = (getStats.requestTotalDurationMillis / 1000 / 60 / 60);
            getStats.requestPerMinute = getStats.requestTotalDurationHours

            // open perfDataStorage key value store
            const perfDataStorage = await Apify.openKeyValueStore('perfDataStorage');

            // save perf data to file named datasetTitle
             await perfDataStorage.setValue(datasetTitle, { getStats });

            // delete stats SDK_CRAWLER_STATISTICS
            const SDK_CRAWLER_STATISTICS = await Apify.setValue('SDK_CRAWLER_STATISTICS_0', null);
            console.dir(getStats);
}
mnmkng commented 3 years ago

Yeah, it's not very intuitive, but if we overwrite the stats automatically, then other users, who do incremental crawls, would not be able to track total crawling stats. We chose to support both repeated and incremental crawls, at the cost of a less intuitive interface for the repeated crawls.