Open ryanday36 opened 4 months ago
Thanks for pointing this out. Definitely something to look into further. I know that SQLite is pretty lightweight, and when there are a heavy number of concurrent reads and writes, especially for a fair-share update (which is a lot of writes), it could lead to low concurrency.
I wonder if there is potential to optimize the command that updates the job usage values for all of the associations in a flux-accounting DB. Right now, the command iterates through every row in the association_table
and if it needs to make an update, it acquires a lock on the database to make that update. Repeat this process for a large number of users, and that's a lot of locks... I'll look into seeing if I can perhaps wrap all of the updates that would happen here into a single transaction that gets written to the database at once instead of for each row.
I've noticed that, on corona,
flux account
commands hang when the fairshare tasks in theaccounting
cron tab are running. E.g.flux account view-user ...
usually runs in less than a second, but takes ~2.5 minutes when run at the same time as the accounting cron tab. That's not terrible, so this probably isn't the highest priority, but it would be good if those updates could be done in a way that doesn't block queries.I also don't know how much scaling testing has been done with the
update-usage
and related scripts, but I do worry that this could be a larger issue as we move Flux to larger systems with, potentially, many more jobs.