add background workers that Do The Thing

cldellow / datasette-scraper

Add website scraping abilities to Datasette

Apache License 2.0

60 stars 1 forks source link

add background workers that Do The Thing #7

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

datasette-scale-to-zero uses an asyncio loop that we could be inspired by

Ultimately, to make full use of a multicore machine I think we're going to need to use multiprocessing.

The actual writes will have to go through the datasette writer (https://github.com/simonw/datasette/blob/867e0abd3429f837d5f15e6843a38f848ee562f0/docs/internals.rst#await-dbexecute_writesql-paramsnone-blocktrue), but maybe the workers can open read-only connections to the database (see https://docs.datasette.io/en/stable/internals.html#internals-database to get path)

I think to have the writes go through the datasette writer we'll need some form of IPC: https://docs.python.org/3/library/multiprocessing.html#exchanging-objects-between-processes

Maybe: 1 queue that all workers use to send requests to the parent. N queues that the workers use to receive responses. The parent spawns an asyncio worker that exists just to do the IPC.

cldellow commented 1 year ago

...maybe we spawn a single background worker that does all the coordination, and it spawns multiprocessing processes to do each loop of the plugin system?

Hm, how does a crawl get started? Who is responsible for actually calling get_seed_urls:

the datasette plugin when it reacts to the user's event
the coordinator (how does it know?)
a worker (we may not have any "live" workers?)

I think the coordinator should do it. The app can signal the coordinator to do the thing by putting a message in its queue.

cldellow commented 1 year ago

Actually, it's not even clear that writes have to go through the datasette writer?

I guess at some level of concurrency, contention for the file lock might drastically reduce throughput.

cldellow commented 1 year ago

Let's start with the workers opening write connections to the sqlite database, and see if that causes grief. In particular, I wonder if stale reads will be an issue?

If the act of finding an item to work on is expensive, perhaps we'll do that in a read only connection, then do something like UPDATE work_items SET claimed = true WHERE id = XXX and NOT claimed, and retry if someone stole it from underneath us.

cldellow commented 1 year ago

How should the overall architecture work?

Imagine you click "Start job" in the UI. This will:

create an entry in _dss_job, with status initial
enqueue a message to the coordinator to check for jobs that need to be initialized
the coordinator will spawn a seeding-specific background worker whose only job is to initialize the crawl
the worker will call the get_seed_urls hooks and insert them into _dss_crawl_queue
the worker will communicate to coordinator that it has seeded
the worker will shutdown (?)
the coordinator will spawn, if needed, up to N crawl/extract workers
those workers will run while there are data in _dss_crawl_queue -- when that runs out, they'll terminate

Ok, probably missing a bunch of stuff, but let's give this a shot.

cldellow commented 1 year ago

The loop is implemented, so closing.