Closed jmettraux closed 5 years ago
I have it running, I can see the logs exactly as you indicated.
I am all set up. Let's see.
@godfat @conet I can put my finger on it:
require 'fugit'
cron = Fugit.parse('10 * * * * *')
t = cron.previous_time.to_f# + 0.123 (some float x so that 0.0 <= x < 1.0)
# generate a t that is on the target
cron.previous_time(Time.at(t))
# feed it to `#previous_time`
It was right under my nose. #previous_time
called within a "matching second".
I will fix that and I will place a better breaker system just as discussed above.
Thanks to both of you for the patience and the guidance.
(Now I have to stop the gitlab instance I built, funny, as I typed this comment, it just entered in the "loop". I didn't need it, probably doing the install help clear my head)
Now that you mention the match_second
part I do recall seeing in on the top of the stack trace often, I'm glad we got to the bottom of it, thanks for your support.
Wow thank you both so much to get to the bottom of this! Very impressive debugging process. Sorry I didn't try to jump in because apparently I didn't exactly follow the reasoning, and I still don't really understand what went wrong.
I can provide some general ideas though.
You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.
@jmettraux In general I think breaking on counting iterations is better than time based approach though, because what @conet said:
Yeah but that has the disadvantage that it might mean more or less iterations depending on various factors including current load so it might actually make this more unpredictable.
This can become very unpredictable. My favorite example was an old game The 7th Guest, there's a mini-game which the AI was pretty weak when the game was released. 20 years later the AI became unbeatable, because the AI only stopped searching for the best move based on time, and for a 20-years later computer, which had so much more powerful computing power, it can find the absolutely best move (because the game wasn't that complex anyway)
Beside, a timer requires another thread to watch it. If we have a global timer then this is fine resource-wise, but if we need a new timer for each calculation, then this will double the threads we need. We can also look at the time for each iteration, but again for example under heavy load, or we somehow want to suspend the execution and resume it later, they'll all interfere it and might cause it to return too early, breaking the application under load. This will be very hard to debug unless we know here it's time sensitive.
Something like that
41 years and a few days... there wont'be a next or a previous time
but expressed as the number of iterations that would be enough to reach that state under normal circumstances.
Yes, if something like this can be done that'll be ideal. It's like a sanity check, based on complexity. I don't know cron and time enough to know if this is feasible though.
It was right under my nose. #previous_time called within a "matching second".
Is this because floating point precision?
Thanks to both of you for the patience and the guidance.
@jmettraux Thank you so much for your dedication to get to the bottom of this, and willing to install GitLab to try it out. I know that can be cumbersome. Please don't get yourself too much stress though, it's already much appreciated if you can take a look.
@conet Thank you for providing all the logs and traces and willing to go through it patiently especially when you're not familiar with Ruby. I know it takes a lot of efforts to spin an instance, prepare the patch, run it, and wait for it, and finally collect logs, especially when it might not reproduce it at all.
Hello @conet,
if you had the time to confirm the issue is fixed. That'd be great.
The fixed fugit has been running for 8 hours on my GitLab enterprise instance and I see no issues, no endless loops. (I just dropped the updated https://github.com/floraison/fugit/blob/b50bbfdb243f4ce4e8ce0d966eab25c6833ab594/lib/fugit/cron.rb file over the defective one).
If the fix is confirmed, I will release fugit 1.1.9.
Thanks in advance.
Since I scraped my test instance after you managed to reproduce it I was looking today for an easier way to do it, if you have docker installed here's a one-liner:
docker run --detach --env GITLAB_OMNIBUS_CONFIG="gitlab_rails['stuck_ci_jobs_worker_cron'] = \"10 * * * * *\"" --name gitlab --volume ${PWD}/gitlab/config:/etc/gitlab --volume ${PWD}/gitlab/logs:/var/log/gitlab --volume ${PWD}/gitlab/data:/var/opt/gitlab gitlab/gitlab-ce:11.9.0-ce.0
Let it run a while, the endless loop will kick in (check cpu usage in the container), now that I managed to reproduce it this way, I will give your fix a try.
After several hours running the above container with your fix, the endless loop haven't kicked in, I'd say that the issue can be closed.
Hello @conet and @godfat,
I have released fugit 1.1.9.
Thank you very much! If there is anything wrong, please tell me!
Special thanks to @conet for fighting for his case and pushing fugit in the right direction!
Thank you all! I created a merge request to update Fugit in GitLab: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26579
@conet hijacked the closed issue gh-11, challenging the fix there, but provided nothing that supports his accusation.
In fact, there is a source issue report at: https://gitlab.com/gitlab-org/gitlab-ce/issues/59273
OP adds:
In other words: "5 *" means "every time the second hand hits the 5".
Sidekiq-cron allows for second crons (see https://github.com/ondrejbartas/sidekiq-cron/pull/240). These are NOT invalid cron expressions.
Questions to @conet :
you mention
as an endless loop. Do you mean that this warn trace repeats endlessly in your log?
Does the LDAP sync work your settings seem to imply happen or not? Does the LDAP sync work happens every 5 / 10 seconds ? Does it happen but with a different schedule ? Or is it NOT happening at all?
Thanks in advance and best regards.