gearman / gearmand

http://gearman.org/
Other
727 stars 138 forks source link

sqlite db is constantly accessed even though --store-queue-on-shutdown is set #369

Closed sni closed 1 year ago

sni commented 1 year ago

While stracing gearmand for another issue, i noticed gearmand constantly accessing sqlite files:

[pid 119607] 15:12:35.778284 newfstatat(AT_FDCWD, ".../var/gearmand.db-wal", 0x7fff8c249b30, 0) = -1 ENOENT (No such file or directory)
[pid 119607] 15:12:35.780124 newfstatat(AT_FDCWD, ".../var/gearmand.db-journal", 0x7fff8c249b30, 0) = -1 ENOENT (No such file or directory)

I could track it down to the Instance::done() function which removes entries from the database. In my opinion this seems unnecessary since i only want to dump the queue during shutdown. I disabled that done() function and it works as well as before. Queue is dumped to disk on shutdown and read during startup. Is there any reason for this? If not, i would prepare a PR to skip done() if store-queue-on-shutdown is enabled.

esabol commented 1 year ago

Is there any reason for this?

I imagine it's for fault tolerance (power goes out and you don't have a UPS, for example), but I don't use any of the persistence backends, nor was I involved with their design.

Any comments on this, @SpamapS ?

SpamapS commented 1 year ago

That's exactly why it flushes on done, and I wouldn't want to change the behavior of an existing command line switch.

But I think you could introduce a new switch, something like --skip-flush-on-done, if you're comfortable with data loss on failure, and are just doing this so you can restart without losing jobs.

But, a better plan is to just not use the queue plugins and switch to foreground jobs that handle their own persistence.

On Wed, Jun 14, 2023, 1:30 PM Ed Sabol @.***> wrote:

Is there any reason for this?

I imagine it's for fault tolerance (power goes out and you don't have a UPS, for example), but I don't use any of the persistence backends, nor was I involved with their design.

Any comments on this, @SpamapS https://github.com/SpamapS ?

— Reply to this email directly, view it on GitHub https://github.com/gearman/gearmand/issues/369#issuecomment-1591933981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS6YDBI3GIT5IGAVSR2P3XLINQBANCNFSM6AAAAAAZGPGIQI . You are receiving this because you were mentioned.Message ID: @.***>

sni commented 1 year ago

Let me explain why i think it not necessary. My use case is to have persistency during restarts/reboots and other than that, there is no need to store anything anywhere. My command switches are therefore -q libsqlite3 --libsqlite3-db=.../var/gearmand.db --store-queue-on-shutdown.

So (as far as i understood) it works like this right now.

  1. gearmand starts and reads all jobs from the retention db.
  2. then loops: a. job comes in, stored in memory b. job finishes c. job gets removed from the db (but it has never been stored there because of store-queue-on-shutdown)
  3. gearmand shuts down and writes all remaining jobs to the db.

So there is no persistency at all and i am fine with that. And the 2c step is useless because the sqlite is empty.

But i am happy with adding a separate command line switch if that makes more sense. But in my opinion this adds no value and just increases confusion.

esabol commented 1 year ago

If there's really no job in the database to remove, then I find your point persuasive, @sni.

However, let's instead consider the scenario in which you have jobs queued and gearmand is shutdown. In this scenario with --store-queue-on-shutdown, the jobs are written to the database and gearmand terminates. When gearmand restarts, it reads the jobs from the database. Does it remove them from the database when gearmand starts? I suspect it does not. I suspect they are removed as they are completed, and that seems sensible to me. (Again, please forgive my ignorance of how the persistence backends work since I don't use them.)

So, assuming that's how it works, it seems to me that the ideal behavior with --store-queue-on-shutdown would be for gearmand to keep track of which jobs have been restored from the database when starting up and only remove those from the database upon completion. For any new jobs which haven't been stored in the database, it should not try to remove them from the database since they've never been stored in the database in the first place. Is that feasible? Consider the case when you have many thousands of jobs queued. That is probably what led to the design decision to just always try to remove the job from the database regardless of origin. @SpamapS's suggestion to add another command line switch to control this behavior seems like a reasonable compromise.

However, if gearmand reads the jobs from the database at startup and removes them from the database at the same time, then I think you are correct that it should not be trying to remove each job from the database upon completion.

sni commented 1 year ago

ideally gearmand should imho truncate the database after startup and after importing all jobs back into memory. I don't think it makes sense to store anywhere whether a job came from the retention db or not. This should make no difference. Once started, gearmand should handle all jobs equally.

Testing this a bit, starting with a fresh (unpatched) gearmand.

  1. gearmand start, creates a empty retention database.
  2. jobs come in, database still empty
  3. stopping gearmand, database now has the remaining jobs
  4. starting gearmand again, jobs are read into memory but not removed from the database
  5. jobs are beeing worked on (initial jobs are removed from the database, new jobs a tried to remove, but they are not in the db at all)

So it seems like just disabling the done() won't work, this would leave those jobs in the database. How about adding a truncate after the startup instead of trying to delete these jobs on runtime which only succeeds for the first few jobs.

From my understanding the --store-queue-on-shutdown should make gearmand not touch the database during runtime except start/stop.

SpamapS commented 1 year ago

So TBH I have always misunderstood store on shutdown because I dislike the queue plugins so much. I appreciate you highlighting that it works in the way it does.

But the sqlite queue plugin is maybe the worst of all of them and falls into the list of things I think are a pretty bad idea. If you want durable queueing, gearmand is a bad choice. You are wasting gearman's scalability by using it in this way and almost anything else will do a better job of holding onto and distributing background messages that are important. Gearman's unique capabilities are only useful in foreground messages where two-way communication is important.

But yeah, if you add a truncate of the database after reading on startup, and then remove the deletions on done when in the same mode, that sounds like a better way to make this misguided feature work.

On Thu, Jun 15, 2023 at 9:12 AM Sven Nierlein @.***> wrote:

ideally gearmand should imho truncate the database after startup and after importing all jobs back into memory. I don't think it makes sense to store anywhere whether a job came from the retention db or not. This should make no difference. Once started, gearmand should handle all jobs equally.

Testing this a bit, starting with a fresh (unpatched) gearmand.

  1. gearmand start, creates a empty retention database.
  2. jobs come in, database still empty
  3. stopping gearmand, database now has the remaining jobs
  4. starting gearmand again, jobs are read into memory but not removed from the database
  5. jobs are beeing worked on (initial jobs are removed from the database, new jobs a tried to remove, but they are not in the db at all)

So it seems like just disabling the done() won't work, this would leave those jobs in the database. How about adding a truncate after the startup instead of trying to delete these jobs on runtime which only succeeds for the first few jobs.

From my understanding the --store-queue-on-shutdown should make gearmand not touch the database during runtime except start/stop.

— Reply to this email directly, view it on GitHub https://github.com/gearman/gearmand/issues/369#issuecomment-1593358988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS6YEI6CDFOMP223X7ODTXLMYAXANCNFSM6AAAAAAZGPGIQI . You are receiving this because you were mentioned.Message ID: @.***>

sni commented 1 year ago

i updated the pull request to reflect this ideas. Now during normal runtime no reads/writes are taken out to the database. Only during start/stop the queue is dumped into the database. So you get the high performance throughput without a queue plugin but still be able to restart gearmand without data loss.

Also i totally disagree, gearman does a great job on background jobs. I've never done a single foreground job since i started using gearman. All communication is asynchronous. I always thought this is the great value gearman adds. If i wanted direct foreground communication, i wouldn't need gearman at all and my services would simply talk to each other...

SpamapS commented 1 year ago

I'm glad you've found it useful. Enough people do that I have relented my desire to remove them and just try to be supportive. But if you weren't already implemented on gearman, you'd likely find that there are many great async messaging options with wider support and intentional design around durability. They also most likely have more multiplex-able protocols. These queue plugins were bolted onto Gearman as a way to help people who were stuck, low scale, and didn't want to invest in durability at the worker level. If that's you, hooray, I will not fight you in using them. I hope we can get your change merged.

However, I don't quite understand why folks are ok with losing all their messages to a crash, but not to a restart. The option exists, so you're maybe not the first to have this set of constraints and tolerances, but it just surprises me as it seems more complicated than just implementing durable messaging at the worker level, which scales better and is more resilient to single points of failure. You can see examples in Garivini and gearstore.

The main point of gearman is not direct comms between services either though. Of course if you have two services that just need to talk to each other you just go through a load balancer.

It is specifically to coalesce expensive requests with limited resources to prevent thundering herd. It works quite well for cache refresh when soft expiration isn't an option.

https://fewbar.com/2020/04/gearman-is-misunderstood/ explains it in detail.

Anyway, don't take my grumpiness about queue plugins as anything other than that. Thanks for bringing your energy and focus to the problem.