alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Store tags to a persistent SQL store #130

Closed pudo closed 4 years ago

pudo commented 4 years ago

This is a draft, I would appreciate your feedback @sunu. It stores the tags to SQL, but does not store which crawler a tag belongs to explicitly, meaning there can be no tag flushing for a whole dataset.

I'm trying to decide how big a fault/downside that is. What we gain is simplicity.

Update: so I've added support for tags.delete(prefix=crawler.name) to servicelayer.

sunu commented 4 years ago

Looks great! I'm keen to move Events to the sql store too. Should that be done through servicelayer too?

pudo commented 4 years ago

I'm keen to understand how that would interact with the servicelayer.reporting module written by @simonwoerpel - it produces the same sort of data but has no persistence (yet).

sunu commented 4 years ago

I'm keen to understand how that would interact with the servicelayer.reporting module written by @simonwoerpel - it produces the same sort of data but has no persistence (yet).

I'll take a look at that to get an idea how well it fits with the data from memorious events.

Besides that, is this ok to merge or do we need to make any servicelayer related config changes in the prod configs before this is released?

pudo commented 4 years ago

This should be ready for prod, I've tested the implementation mainly with Aleph but it seems stable.

What I'd propose we do in terms of upgrade is to a) wait till there's few crawlers running, b) Remove all the crawlers deployments, c) delete crawlers-redis from disk and remove the backup function for it - not needed any more, d) bring this up fresh.