Closed cldellow closed 1 year ago
...maybe we spawn a single background worker that does all the coordination, and it spawns multiprocessing processes to do each loop of the plugin system?
Hm, how does a crawl get started? Who is responsible for actually calling get_seed_urls:
I think the coordinator should do it. The app can signal the coordinator to do the thing by putting a message in its queue.
Actually, it's not even clear that writes have to go through the datasette writer?
I guess at some level of concurrency, contention for the file lock might drastically reduce throughput.
Let's start with the workers opening write connections to the sqlite database, and see if that causes grief. In particular, I wonder if stale reads will be an issue?
If the act of finding an item to work on is expensive, perhaps we'll do that in a read only connection, then do something like UPDATE work_items SET claimed = true WHERE id = XXX and NOT claimed
, and retry if someone stole it from underneath us.
How should the overall architecture work?
Imagine you click "Start job" in the UI. This will:
initial
get_seed_urls
hooks and insert them into _dss_crawl_queue
_dss_crawl_queue
-- when that runs out, they'll terminateOk, probably missing a bunch of stuff, but let's give this a shot.
The loop is implemented, so closing.
datasette-scale-to-zero uses an asyncio loop that we could be inspired by
Ultimately, to make full use of a multicore machine I think we're going to need to use multiprocessing.
The actual writes will have to go through the datasette writer (https://github.com/simonw/datasette/blob/867e0abd3429f837d5f15e6843a38f848ee562f0/docs/internals.rst#await-dbexecute_writesql-paramsnone-blocktrue), but maybe the workers can open read-only connections to the database (see https://docs.datasette.io/en/stable/internals.html#internals-database to get path)
I think to have the writes go through the datasette writer we'll need some form of IPC: https://docs.python.org/3/library/multiprocessing.html#exchanging-objects-between-processes
Maybe: 1 queue that all workers use to send requests to the parent. N queues that the workers use to receive responses. The parent spawns an asyncio worker that exists just to do the IPC.