This patch adds two new utility classes — Dataset and KeyValueStore — that can be used for general purpose data squirreling during crawls.
Datasets are meant to hold many simple records of the same structure (for example, columnar data accumulated during each page visit, or analytics numbers retrieved from an API).
KeyValueStores are simple dictionaries that can be used to store specific information that's outside the domain of existing Spidergram entities, or to offload bulky properties for existing entities. (For example, moving the body property out of Resource into a separate KeyValueStore sped up some of our analytical queries by 5-10x on a large dataset).
const kvs = await KeyValueStore.open('html_storage');
await kvs.setValue(resource.key, html);
const ds = await Dataset.open(); // If no name is passed in, 'default' is used
await ds.pushItem([
{ col1: 'data', col2: 'more data' },
{ col1: 'another record', col2: 'its data' }
...
]);
const datasetValues = await ds.getItems();
const resourceHtml = await kvs.getItem(resource.key) as string;
Both classes are backed by Arango collections; their static factory functions open existing named Dataset/KVS, or create ones with the requested name if they don't already exist. Old data can be cleared out using the empty() or drop() methods on the Dataset and KeyValueStore classes. Because the data is being stored in vanilla ArangoDB collections, it can also be used in custom queries.
These two classes behave similarly to the classes of the same name in Crawlee; it's no accident, as one of our future to-do's is adding a Crawlee StorageProvider for ArangoDB so its crawl status data is queryable in the DB rather than stored on disk. In the meantime, though, these classes are useful general-purpose utilities.
This patch adds two new utility classes —
Dataset
andKeyValueStore
— that can be used for general purpose data squirreling during crawls.body
property out ofResource
into a separate KeyValueStore sped up some of our analytical queries by 5-10x on a large dataset).Both classes are backed by Arango collections; their static factory functions open existing named Dataset/KVS, or create ones with the requested name if they don't already exist. Old data can be cleared out using the
empty()
ordrop()
methods on the Dataset and KeyValueStore classes. Because the data is being stored in vanilla ArangoDB collections, it can also be used in custom queries.These two classes behave similarly to the classes of the same name in Crawlee; it's no accident, as one of our future to-do's is adding a Crawlee StorageProvider for ArangoDB so its crawl status data is queryable in the DB rather than stored on disk. In the meantime, though, these classes are useful general-purpose utilities.