floraison / fugit

time tools (cron, parsing, durations, ...) for Ruby, rufus-scheduler, and flor
MIT License
385 stars 34 forks source link

endless loop dealing with cron "seconds" string #15

Closed jmettraux closed 5 years ago

jmettraux commented 5 years ago

@conet hijacked the closed issue gh-11, challenging the fix there, but provided nothing that supports his accusation.

In fact, there is a source issue report at: https://gitlab.com/gitlab-org/gitlab-ce/issues/59273

we had:

gitlab_rails['ldap_sync_worker_cron'] = "5 * * * * *"
gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * * *"

instead of:

gitlab_rails['ldap_sync_worker_cron'] = "5 * * * *"
gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * *"

Once the cron expression is fixed cpu usage goes back to normal

OP adds:

invalid cron expressions should not cause such behavior.

irb(main):001:0> require 'Fugit'
=> true
irb(main):002:0> Fugit.parse('5 * * * * *')
=> #<Fugit::Cron:0x00007f91ba8d54c0
  @original="5 * * * * *", @cron_s=nil, @seconds=[5],
  @minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
  @zone=nil, @timezone=nil>
irb(main):003:0> Fugit.parse('10 * * * * *')
=> #<Fugit::Cron:0x00007f91ba844650 @original="10 * * * * *", @cron_s=nil, @seconds=[10],
  @minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
  @zone=nil, @timezone=nil>

In other words: "5 *" means "every time the second hand hits the 5".

Sidekiq-cron allows for second crons (see https://github.com/ondrejbartas/sidekiq-cron/pull/240). These are NOT invalid cron expressions.

Questions to @conet :

Thanks in advance and best regards.

jmettraux commented 5 years ago

I have it running, I can see the logs exactly as you indicated.

I am all set up. Let's see.

jmettraux commented 5 years ago

@godfat @conet I can put my finger on it:

require 'fugit'  

cron = Fugit.parse('10 * * * * *')  

t = cron.previous_time.to_f# + 0.123 (some float x so that 0.0 <= x < 1.0)  
  # generate a t that is on the target

cron.previous_time(Time.at(t)) 
  # feed it to `#previous_time`

It was right under my nose. #previous_time called within a "matching second".

I will fix that and I will place a better breaker system just as discussed above.

Thanks to both of you for the patience and the guidance.

(Now I have to stop the gitlab instance I built, funny, as I typed this comment, it just entered in the "loop". I didn't need it, probably doing the install help clear my head)

conet commented 5 years ago

Now that you mention the match_second part I do recall seeing in on the top of the stack trace often, I'm glad we got to the bottom of it, thanks for your support.

godfat commented 5 years ago

Wow thank you both so much to get to the bottom of this! Very impressive debugging process. Sorry I didn't try to jump in because apparently I didn't exactly follow the reasoning, and I still don't really understand what went wrong.

I can provide some general ideas though.

You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.

@jmettraux In general I think breaking on counting iterations is better than time based approach though, because what @conet said:

Yeah but that has the disadvantage that it might mean more or less iterations depending on various factors including current load so it might actually make this more unpredictable.

This can become very unpredictable. My favorite example was an old game The 7th Guest, there's a mini-game which the AI was pretty weak when the game was released. 20 years later the AI became unbeatable, because the AI only stopped searching for the best move based on time, and for a 20-years later computer, which had so much more powerful computing power, it can find the absolutely best move (because the game wasn't that complex anyway)

Beside, a timer requires another thread to watch it. If we have a global timer then this is fine resource-wise, but if we need a new timer for each calculation, then this will double the threads we need. We can also look at the time for each iteration, but again for example under heavy load, or we somehow want to suspend the execution and resume it later, they'll all interfere it and might cause it to return too early, breaking the application under load. This will be very hard to debug unless we know here it's time sensitive.

Something like that 41 years and a few days... there wont'be a next or a previous time but expressed as the number of iterations that would be enough to reach that state under normal circumstances.

Yes, if something like this can be done that'll be ideal. It's like a sanity check, based on complexity. I don't know cron and time enough to know if this is feasible though.

It was right under my nose. #previous_time called within a "matching second".

Is this because floating point precision?

Thanks to both of you for the patience and the guidance.

@jmettraux Thank you so much for your dedication to get to the bottom of this, and willing to install GitLab to try it out. I know that can be cumbersome. Please don't get yourself too much stress though, it's already much appreciated if you can take a look.

@conet Thank you for providing all the logs and traces and willing to go through it patiently especially when you're not familiar with Ruby. I know it takes a lot of efforts to spin an instance, prepare the patch, run it, and wait for it, and finally collect logs, especially when it might not reproduce it at all.

jmettraux commented 5 years ago

Hello @conet,

if you had the time to confirm the issue is fixed. That'd be great.

The fixed fugit has been running for 8 hours on my GitLab enterprise instance and I see no issues, no endless loops. (I just dropped the updated https://github.com/floraison/fugit/blob/b50bbfdb243f4ce4e8ce0d966eab25c6833ab594/lib/fugit/cron.rb file over the defective one).

If the fix is confirmed, I will release fugit 1.1.9.

Thanks in advance.

conet commented 5 years ago

Since I scraped my test instance after you managed to reproduce it I was looking today for an easier way to do it, if you have docker installed here's a one-liner:

docker run --detach --env GITLAB_OMNIBUS_CONFIG="gitlab_rails['stuck_ci_jobs_worker_cron'] = \"10 * * * * *\"" --name gitlab --volume ${PWD}/gitlab/config:/etc/gitlab --volume ${PWD}/gitlab/logs:/var/log/gitlab --volume ${PWD}/gitlab/data:/var/opt/gitlab gitlab/gitlab-ce:11.9.0-ce.0

Let it run a while, the endless loop will kick in (check cpu usage in the container), now that I managed to reproduce it this way, I will give your fix a try.

conet commented 5 years ago

After several hours running the above container with your fix, the endless loop haven't kicked in, I'd say that the issue can be closed.

jmettraux commented 5 years ago

Hello @conet and @godfat,

I have released fugit 1.1.9.

Thank you very much! If there is anything wrong, please tell me!

Special thanks to @conet for fighting for his case and pushing fugit in the right direction!

godfat commented 5 years ago

Thank you all! I created a merge request to update Fugit in GitLab: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26579