autogram-is / spidergram

Structural analysis tools for complex web sites
GNU General Public License v3.0
111 stars 4 forks source link

General-purpose Dataset and KeyValueStore classes #58

Closed eaton closed 1 year ago

eaton commented 1 year ago

This patch adds two new utility classes — Dataset and KeyValueStore — that can be used for general purpose data squirreling during crawls.

const kvs = await KeyValueStore.open('html_storage');
await kvs.setValue(resource.key, html);

const ds = await Dataset.open(); // If no name is passed in, 'default' is used
await ds.pushItem([
    { col1: 'data', col2: 'more data' },
    { col1: 'another record', col2: 'its data' }
    ...
]);

const datasetValues = await ds.getItems();
const resourceHtml = await kvs.getItem(resource.key) as string;

Both classes are backed by Arango collections; their static factory functions open existing named Dataset/KVS, or create ones with the requested name if they don't already exist. Old data can be cleared out using the empty() or drop() methods on the Dataset and KeyValueStore classes. Because the data is being stored in vanilla ArangoDB collections, it can also be used in custom queries.

These two classes behave similarly to the classes of the same name in Crawlee; it's no accident, as one of our future to-do's is adding a Crawlee StorageProvider for ArangoDB so its crawl status data is queryable in the DB rather than stored on disk. In the meantime, though, these classes are useful general-purpose utilities.