Database backend for jobs

CGRU / cgru

CGRU - AFANASY

http://cgru.info/

GNU Lesser General Public License v3.0

278 stars 111 forks source link

Database backend for jobs #580

Open lithorus opened 1 year ago

lithorus commented 1 year ago

The current database is only in memory and restarting the afanasy server service can take quite a long time, since it needs to re-add all jobs again from the JSON files.

With a large job queue it can takes up to 30 min.

The other issue with this kind of setup is that what's on disk does not match what's in memory (the recent problem with custom JSON data on blocks).

My suggestions would be :

Support for postgres database but keep the data as it is now, but use a JSON field in postgres. (but still keep the logs on disk)
Re-structure the data from JSON and make it more native to a relational database.

Another solution could also be something like a redis database, which might be easier to mimic the current implementation.

timurhai commented 1 year ago

Hello, Jimmy! There are some other issues to speed up server start and reduce memory usage to keep larger amount of jobs.

On start let server solve constructed jobs while it still reads other jobs. It can read jobs from disk in store folder modification order.
Deleted jobs should be stored in some archive at first before a real deletion. This way you reduce the amount of jobs that server keeps in memory. But you can't have a bigger running amount of jobs. https://github.com/CGRU/cgru/issues/240

lithorus commented 1 year ago

Even when clearing the job database with jobs, we still have about 10k as a minimum which takes about 10 min to load on cached files. I've started doing an experiment with replacing the Jsonwrite function to use redis backend and will make some benchmarks to compare the performance. It just uses the filepath as key which makes it a drop in replacement.

sebastianelsner commented 8 months ago

Hej @lithorus did you get anywhere with this? We are hitting the same issue. It would be great to be able and offload the done jobs so: 1) they are not all loaded at startup -> faster start -> faster solve cycles -> better responsiveness of afserver 2) they are stored in a database where we can keep them for an extended amount of time, able to restore if needed, able to see all the data it produced.

lithorus commented 8 months ago

I had a look at replacing the file based json files with a NOSQL database. This way it would keep the same structure, but wanted to test if it was actually any faster. The next step would be to create a proper database structure, but it will be hard to keep compatibility with current implementation.

I might have just do a hard fork with the database work to test it out. Also, I'm looking into why it's so slow at loading at times. Startup of one of our afanasy servers with 17k jobs took about atleast 1 hour to start.

Also it's worth noting that if you change too many jobs at a time, it will crash with this many jobs, especially if changing priorities on several hundred jobs at a time.