floraison / fugit

time tools (cron, parsing, durations, ...) for Ruby, rufus-scheduler, and flor
MIT License
355 stars 29 forks source link

endless loop dealing with cron "seconds" string #15

Closed jmettraux closed 5 years ago

jmettraux commented 5 years ago

@conet hijacked the closed issue gh-11, challenging the fix there, but provided nothing that supports his accusation.

In fact, there is a source issue report at: https://gitlab.com/gitlab-org/gitlab-ce/issues/59273

we had:

gitlab_rails['ldap_sync_worker_cron'] = "5 * * * * *"
gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * * *"

instead of:

gitlab_rails['ldap_sync_worker_cron'] = "5 * * * *"
gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * *"

Once the cron expression is fixed cpu usage goes back to normal

OP adds:

invalid cron expressions should not cause such behavior.

irb(main):001:0> require 'Fugit'
=> true
irb(main):002:0> Fugit.parse('5 * * * * *')
=> #<Fugit::Cron:0x00007f91ba8d54c0
  @original="5 * * * * *", @cron_s=nil, @seconds=[5],
  @minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
  @zone=nil, @timezone=nil>
irb(main):003:0> Fugit.parse('10 * * * * *')
=> #<Fugit::Cron:0x00007f91ba844650 @original="10 * * * * *", @cron_s=nil, @seconds=[10],
  @minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
  @zone=nil, @timezone=nil>

In other words: "5 *" means "every time the second hand hits the 5".

Sidekiq-cron allows for second crons (see https://github.com/ondrejbartas/sidekiq-cron/pull/240). These are NOT invalid cron expressions.

Questions to @conet :

Thanks in advance and best regards.

godfat commented 5 years ago

Thank you for looking into this! (I can't reproduce the issue by myself)

jmettraux commented 5 years ago

@godfat you're welcome! If it's my code, I have to fix it :-)

conet commented 5 years ago

I just got online, I've managed to add a log statement in gitlab:

/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb

to get the time value going into:

parsed_cron.previous_time(current_time).utc

that is causing this, I'm waiting for the results.

To answer the endless loop related question: by endless loop I mean that, based on successive stacktraces, it seems like the code is stuck in the loop in previous_time.

jmettraux commented 5 years ago

@conet thank you very much.

Could you please answer my second question?

@jmettraux wrote:

Does the LDAP sync work your settings seem to imply happen or not? Does the LDAP sync work happens every 5 / 10 seconds ? Does it happen but with a different schedule ? Or is it NOT happening at all?

jmettraux commented 5 years ago

@conet's traces (a bit truncated):

.79856 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: Thread TID-ot89u8ea9 scheduler
.79857 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: gl/emb/lib/ruby/gems/2.5.0/gems/tzinfo-1.2.5/lib/tzinfo/time_or_datetime.rb:321:in `wrap'
.79857 gl/emb/lib/ruby/gems/2.5.0/gems/tzinfo-1.2.5/lib/tzinfo/timezone.rb:420:in `utc_to_local'
.79857 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:358:in `to_time'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:203:in `block (2 levels) in <class:EoTime>'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:76:in `block (2 levels) in <class:TimeCursor>'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:119:in `sec_match?'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:252:in `block in previous_time'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:561:in `not_past_scheduled_time?'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:23:in `block in should_enque?'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq.rb:97:in `block in redis'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
.79861 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
.79861 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq.rb:94:in `redis'
.79861 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:21:in `should_enque?'
.79862 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:42:in `test_and_enque_for_time!'
.79862 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:28:in `enqueue_job'
.79862 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:15:in `block in enqueue'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:14:in `each'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:14:in `enqueue'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/scheduled.rb:68:in `block in start'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/util.rb:16:in `watchdog'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/util.rb:25:in `block in safe_thread'
.46726 2019-03-20T21:47:32.466Z 5353 TID-oxdlhwzy1 WARN: gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/duration.rb:47:in `==='
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/core_ext/time/calculations.rb:236:in `plus_with_duration'
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:129:in `utc'
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:358:in `to_time'
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:203:in `block (2 levels) in <class:EoTime>'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:76:in `block (2 levels) in <class:TimeCursor>'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:119:in `sec_match?'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:252:in `block in previous_time'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
.46729 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
.46729 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:561:in `not_past_scheduled_time?'
2019-03-20_21:58:36.80890 2019-03-20T21:58:36.808Z 5353 TID-oxdlhwzy1 WARN: gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/core_ext/time/calculations.rb:239:in `+'
2019-03-20_21:58:36.80891 gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/core_ext/time/calculations.rb:239:in `plus_with_duration'
2019-03-20_21:58:36.80891 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:129:in `utc'
2019-03-20_21:58:36.80891 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:358:in `to_time'
2019-03-20_21:58:36.80892 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:203:in `block (2 levels) in <class:EoTime>'
2019-03-20_21:58:36.80892 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:76:in `block (2 levels) in <class:TimeCursor>'
2019-03-20_21:58:36.80892 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:119:in `sec_match?'
2019-03-20_21:58:36.80893 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:252:in `block in previous_time'
2019-03-20_21:58:36.80893 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
2019-03-20_21:58:36.80894 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
2019-03-20_21:58:36.80894 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:561:in `not_past_scheduled_time?'
2019-03-20_21:58:36.80894 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:23:in `block in should_enque?'

@conet, are you somehow killing the thread to get those WARN messages? The outcome is different each time (I think I start to understand your "it seems in a loop" statement).

jmettraux commented 5 years ago

Meanwhile, built a small project at https://github.com/jmettraux/fugit_gh15 to try and reproduce the issue.

So far, tried to feed an ActiveSupport enhance Time to #previous_time but no loop went out of control.

Looking forward to @conet log results.

conet commented 5 years ago

Sorry for the delay here is are my results, @godfat I will need your help to transform the gitlab specific data into fugit specific stuff so that @jmettraux can work with that, first this is the logging I added to gitlab's should_enque:

      #crucial part of whole enquing job
      def should_enque? time
        enqueue = false
        @time_str = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
        logger.warn { "should_enque: #{@time_str}" }
        enqueue = Sidekiq.redis do |conn|
          status == "enabled" &&
            not_past_scheduled_time?(time) &&
            not_enqueued_after?(time) &&
            conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time))
        end
        enqueue
      end

I restarted gitlab after the change, it was running without issues for a couple of minutes then it entered the endless loop, proven this time by the logs:

2019-03-21_08:33:05.75627 2019-03-21T08:33:05.756Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.75793 2019-03-21T08:33:05.757Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76032 2019-03-21T08:33:05.759Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: start
2019-03-21_08:33:05.76533 2019-03-21T08:33:05.765Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76710 2019-03-21T08:33:05.767Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76857 2019-03-21T08:33:05.768Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76933 2019-03-21T08:33:05.769Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77176 2019-03-21T08:33:05.771Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77327 2019-03-21T08:33:05.773Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77441 2019-03-21T08:33:05.774Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77558 2019-03-21T08:33:05.775Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77670 2019-03-21T08:33:05.776Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77811 2019-03-21T08:33:05.778Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77979 2019-03-21T08:33:05.779Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78125 2019-03-21T08:33:05.781Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78223 2019-03-21T08:33:05.782Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78332 2019-03-21T08:33:05.783Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78403 2019-03-21T08:33:05.784Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78503 2019-03-21T08:33:05.784Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78605 2019-03-21T08:33:05.786Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78708 2019-03-21T08:33:05.787Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78948 2019-03-21T08:33:05.789Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.79007 2019-03-21T08:33:05.790Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.86100 2019-03-21T08:33:05.771Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: start
2019-03-21_08:33:05.86361 2019-03-21T08:33:05.789Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: start
2019-03-21_08:33:05.86423 2019-03-21T08:33:05.765Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: Started LDAP group sync
2019-03-21_08:33:06.06309 2019-03-21T08:33:06.061Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: done: 0.29 sec
2019-03-21_08:33:06.06425 2019-03-21T08:33:06.063Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: done: 0.274 sec
2019-03-21_08:33:10.87072 2019-03-21T08:33:10.869Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: done: 0.511 sec

By proven I mean that nothing is logged by sidekiq after that... now when I dump stack traces I get:

2019-03-21_10:39:44.36153 2019-03-21T10:39:44.360Z 29013 TID-ovh3gnxw5 WARN: /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:292:in `initialize_dup'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:292:in `dup'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:292:in `inc'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:223:in `+'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:79:in `inc'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:247:in `block in previous_time'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:563:in `not_past_scheduled_time?'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:25:in `block in should_enque?'

This means that the scheduler is stopped by this and the task does not run anymore (to answer the other question). @godfat can you help me with the equivalent of this in gitlabs env?

bundle exec ruby -r et-orbi -e "EtOrbi._make_info"
godfat commented 5 years ago

@conet I am in the middle of something else, but I can help with this quickly:

bundle exec ruby -r et-orbi -e "EtOrbi._make_info"

Run this to start a console: sudo gitlab-rails console and then run:

EtOrbi._make_info

which should print the results.

conet commented 5 years ago
-------------------------------------------------------------------------------------
 GitLab:       11.8.3-ee (822e412ad9a)
 GitLab Shell: 8.4.4
 postgresql:   9.6.11
-------------------------------------------------------------------------------------
Loading production environment (Rails 5.0.7.1)
irb(main):001:0> EtOrbi._make_info
(secs:1553165167.8255365,utc~:"2019-03-21 10:46:07.8255364894866943",ltz~:"UTC")
(etz:":/etc/localtime",tnz:"CET",tziv:"1.2.5",tzidv:nil,rv:"2.5.3",rp:"x86_64-linux",win:false,rorv:"5.0.7.1",astz:[ActiveSupport::TimeZone, "Etc/UTC"],eov:"1.1.7",eotnz:#<TZInfo::Data
Timezone: Etc/UTC>,eotnfz:"+0000",eotlzn:"Etc/UTC",eotnfZ:"UTC",debian:"Europe/Zurich",centos:nil,osx:"Europe/Zurich")
=> nil
irb(main):002:0> 
conet commented 5 years ago

@conet, are you somehow killing the thread to get those WARN messages? The outcome is different each time (I think I start to understand your "it seems in a loop" statement).

Yes, that is sidekiq's way of getting a stacktrace dumped to the logs, by running:

kill -TTIN <sidekiq_pid>
jmettraux commented 5 years ago

@conet Thanks for the great feedback! Working on digesting it.

jmettraux commented 5 years ago

@conet, could you please try again with this updated snippet:

#crucial part of whole enqueueing job
def should_enque? time

  logger.warn {

    s = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
    c = time.class
    n = Time.now.utc

    " /// should_enque() #{time.inspect} (#{c}/#{c.ancestors.inspect})" +
    " / #{s.inspect} / now: #{n.inspect} (d #{n - time}s)" +
    " / let: #{@last_enqueue_time.inspect}" }

  Sidekiq.redis { |conn|

    status == 'enabled' &&
    not_past_scheduled_time?(time) &&
    not_enqueued_after?(time) &&
    conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time)) }
end

(sorry for the reformatting, I need to wrap my head around sidekiq-cron's API)

conet commented 5 years ago

I applied the patch, waiting for the loop to kick in, usually it takes a couple of minutes.

conet commented 5 years ago

It took a bit longer but here it is:

2019-03-21_11:57:05.48766 2019-03-21T11:57:05.487Z 19536 TID-or1c33vzs WARN:  /// should_enque() 2019-03-21 11:57:05 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculati
ons, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, E
RB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "2019-03-21 11:57:05 +
0000" / now: 2019-03-21 11:57:05 UTC (d 0.304640802s) / let: 2019-03-21 11:56:32 UTC

I assume that only the last entry is important, if you want more please let me know.

jmettraux commented 5 years ago

A few early observations:

I restarted gitlab after the change, it was running without issues for a couple of minutes then it entered the endless loop, proven this time by the logs:

2019-03-21_08:33:05.75627 2019-03-21T08:33:05.756Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.75793 2019-03-21T08:33:05.757Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76032 2019-03-21T08:33:05.759Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: start
2019-03-21_08:33:05.76533 2019-03-21T08:33:05.765Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76710 2019-03-21T08:33:05.767Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
...

The endless loop is not the one exposed in gh-11 / gh-13. That loop (https://github.com/floraison/fugit/blob/f1e2d002a6f59ca01c87744f2f3c78dc1931f86a/lib/fugit/cron.rb#L186-L215) is happening in #next_time while your case occurs around #previous_time.

As suggested by @godfat, there is now a breaker in the #next_time loop. At the same time, a breaker was added in the #previous_time loop (https://github.com/floraison/fugit/blob/f1e2d002a6f59ca01c87744f2f3c78dc1931f86a/lib/fugit/cron.rb#L232-L254). The breaker never triggered for you. No re-opening of gh-11 / gh-13.

The log is proving that Sidekiq-Cron is calling #should_enque repeatedly, with the same time value :-(

I've looked at Sidekiq-Cron (https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/job.rb and https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/poller.rb) and I don't see yet how this loop comes to be.

I think that the impression you got that the problem comes from #previous_time is that this method is probably the costliest of the ones used in #should_enque and your kills exposed it repeatedly.

jmettraux commented 5 years ago

@conet,

would you mind testing again with the following patch?


#crucial part of whole enqueueing job
def should_enque? time

  logger.warn {

    s = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
    c = time.class
    n = Time.now.utc

    " /// should_enque() #{time.inspect} (#{c}/#{c.ancestors.inspect})" +
    " / #{s.inspect} / now: #{n.inspect} (d #{n - time}s)" +
    " / let: #{@last_enqueue_time.inspect}" }

  if time == @__previous_should_enque_time
    logger.warn {
      "  \\\\\\ should_enque() called once again with the same time :-(\n" +
      caller.collect { |l| "  \\ #{l}" }.join("\n") }
  else
    @__previous_should_enque_time = time
  end

  Sidekiq.redis { |conn|

    status == 'enabled' &&
    not_past_scheduled_time?(time) &&
    not_enqueued_after?(time) &&
    conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time)) }
end

Thanks in advance.

jmettraux commented 5 years ago

(I'd need two complete caller stacks as output in the log, I want to know what calls this #should_enque repeatedly, or maybe my hypothesis does not stand)

conet commented 5 years ago

I applied the patch, restarted, waiting for the loop to kick in, in the mean time:

grep "called once again with the same time" /var/log/gitlab/sidekiq/current

yields 0 results.

jmettraux commented 5 years ago

@conet wrote:

in the mean time (...) yields 0 results.

Please correct me if I am wrong.

As I understand your feedback, your cron job works for a while, until the 'loop' kicks in and sends #should_enque to repeat itself in the WARN section of the log. The other jobs remain unaffected.

(What I can't explain myself is why your logs above only show the #should_enque for TID-ovgxgte1x, but that might be because you're only showing the repeat section of the log)

conet commented 5 years ago

The loop is back, this gives no results:

grep "called once again with the same time" /var/log/gitlab/sidekiq/current

The last entry was:

2019-03-21_13:53:05.19952 2019-03-21T13:53:05.199Z 5133 TID-or644fx7x WARN:  /// should_enque() 2019-03-21 13:53:05 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculatio
ns, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, ER
B::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "2019-03-21 13:53:05 +0
000" / now: 2019-03-21 13:53:05 UTC (d 0.086686619s) / let: 2019-03-21 13:52:18 UTC

Several consecutive stack traces (extracted only the scheduler thread): sidekiq.log

conet commented 5 years ago

As I understand your feedback, your cron job works for a while, until the 'loop' kicks in and sends #should_enque to repeat itself in the WARN section of the log. The other jobs remain unaffected.

(What I can't explain myself is why your logs above only show the #should_enque for TID-ovgxgte1x, but that might be because you're only showing the repeat section of the log)

No, from what I understand TID-ovgxgte1x is the scheduler thread which, when operating normally, calls should_enque for each job it has (thus producing multiple log entries), but once this loop kicks in, the scheduler thread never goes passed it thus all job processing stops working. That is why I though that only the last log entry is important since that is the one that triggers the endless loop.

jmettraux commented 5 years ago

@conet wrote:

but once this loop kicks in, the scheduler thread never goes passed it thus all job processing stops working. That is why I though that only the last log entry is important since that is the one that triggers the endless loop.

2019-03-21_08:33:05.78708 2019-03-21T08:33:05.787Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78948 2019-03-21T08:33:05.789Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.79007 2019-03-21T08:33:05.790Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.86100 2019-03-21T08:33:05.771Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: start
2019-03-21_08:33:05.86361 2019-03-21T08:33:05.789Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: start
2019-03-21_08:33:05.86423 2019-03-21T08:33:05.765Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: Started LDAP group sync
2019-03-21_08:33:06.06309 2019-03-21T08:33:06.061Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: done: 0.29 sec
2019-03-21_08:33:06.06425 2019-03-21T08:33:06.063Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: done: 0.274 sec
2019-03-21_08:33:10.87072 2019-03-21T08:33:10.869Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: done: 0.511 sec

To me, it seems that this log of yours shows that job processing happens after the #should_enque loop kicks in.

conet commented 5 years ago

To me, it seems that this log of yours shows that job processing happens after the #should_enque loop kicks in.

Well those are the last instances that the scheduler was able to launch, after those no more jobs are launched.

jmettraux commented 5 years ago

OK, so I am completely wrong. Argh.

conet commented 5 years ago

Shouldn't a log statement which prints everything you need to set this up in a minimal demo be enough? I assume that if we have all parameters involved in the previous_time call then it should be reproducible with a single statement, right? Could you help me with this log statement, I assumed that the first patch you sent did that but it seems like it didn't.

jmettraux commented 5 years ago

Sorry, I was following a wrong track. I am preparing a new patch.

jmettraux commented 5 years ago

@conet is your Sidekiq-Cron setup logging errors happening in your cron jobs?

jmettraux commented 5 years ago

Could you please try with this:

#crucial part of whole enqueueing job
def should_enque? time

  logger.warn {

    s = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
    c = time.class
    n = Time.now.utc

    " /// should_enque() #{time.inspect} (#{c}/#{c.ancestors.inspect})" +
    " / #{name.inspect} / #{cron.inspect}" +
    " / #{s.inspect} / now: #{n.inspect} (d #{n - time}s)" +
    " / let: #{@last_enqueue_time.inspect}" }

  status == 'enabled' &&
  not_past_scheduled_time?(time) &&
  not_enqueued_after?(time) &&
  Sidekiq.redis { |conn|
    conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time)) }

rescue Exception => exc
  logger.error {
    " \\\\\\ should_enque() failed: #{exc.inspect}\n" +
    " \\ name: #{name.inspect} \\ cron: #{cron.inspect}\n" +
    " \\ parsed cron: #{parsed_cron.inspect}\n" +
    " \\ trace:\n" + caller.collect { |l| "  \\ #{l}".join("\n") }
  raise
end
conet commented 5 years ago

@conet is your Sidekiq-Cron setup logging errors happening in your cron jobs?

Yes it does, but the jobs themselves run on a different thread so they should not influence the scheduler.

I'm running your latest version, waiting to see if the issue occurs so that I can send the new logs.

conet commented 5 years ago

Endless loop, no failures:

grep 'should_enque() failed' /var/log/gitlab/sidekiq/current

no results, the last entry is:

2019-03-21_15:43:10.47610 2019-03-21T15:43:10.476Z 19788 TID-ouz6ebs6o WARN:  /// should_enque() 2019-03-21 15:43:10 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculati
ons, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, E
RB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "ldap_group_sync_worke
r" / "10 * * * * *" / "2019-03-21 15:43:10 +0000" / now: 2019-03-21 15:43:10 UTC (d 0.227103581s) / let: 2019-03-21 15:42:30 UTC
jmettraux commented 5 years ago

@conet, do we agree that your "ldap_group_sync_worker" job triggers a few times, then goes into the loop?

conet commented 5 years ago

@conet, do we agree that your "ldap_group_sync_worker" job triggers a few times, then goes into the loop?

The job does fire a few times but it’s not the job that has the loop but the scheduler that calls previos_time which belongs to this project, no matter how bugy the library’s clients are they shouldn’t be able to trigger an endless loop in the library. It might be that there is no endless loop but many calls to previos_time but I don’t see that based on the logs and the code, then again I have no ruby expertise, I just want gitlab working properly.

jmettraux commented 5 years ago

The job does fire a few times but it’s not the job that has the loop but the scheduler that calls previos_time which belongs to this project, no matter how bugy the library’s clients are they shouldn’t be able to trigger an endless loop in the library. It might be that there is no endless loop but many calls to previos_time but I don’t see that based on the logs and the code, then again I have no ruby expertise, I just want gitlab working properly.

I do understand that the scheduler thread ends up in the loop, but the loop happens in a method in the job's cron instance. When the scheduler tries to determine if it has to enqueue the job, it queries the job about that, it doesn't trigger the job. I do understand that.

What I am suspecting is #previous_time running OK most of the time, then choking on something. The choke probability seems to augment when you go from "5 " to "5 *", as your first few comments at GitLab seem to indicate.

I am trying to extract facts from your feedback, hence my questions and my poking. You have to understand that we both have to do our best to communicate.

I am no GitLab employee or contractor. I am just a freelance who has just spent 8 hours of a holiday debugging this issue because I care about my tiny open source library.

Shouldn't a log statement which prints everything you need to set this up in a minimal demo be enough? I assume that if we have all parameters involved in the previous_time call then it should be reproducible with a single statement, right? Could you help me with this log statement, I assumed that the first patch you sent did that but it seems like it didn't.

Indeed, something like that:

#require 'active_support/time'
#require 'active_support/time_with_zone'

p Time.class.ancestors
  #
  # On @conet's platform it looks like:
  #
  # [ Time, DateAndTime::Compatibility, DateAndTime::Calculations,
  #    DateAndTime::Zones, Comparable, NakayoshiFork::Behavior,
  #    ActiveSupport::ToJsonWithActiveSupportEncoder, Object,
  #    GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation,
  #    ERB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable,
  #    ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object,
  #    Kernel, BasicObject ]

require 'fugit'; p Fugit::VERSION

cron = Fugit.parse('5 * * * * *')
#cron = Fugit.parse('10 * * * * *')

p cron.previous_time(Time.now.utc)

I'd love to know how Rails, Sidekiq and Sidekiq-cron do it to add all those mixin modules to the Time class. On my system it looks like [Class, Module, Object, ActiveSupport::Tryable, Kernel, BasicObject]

Tomorrow morning, I'll try to save some time from my day job to work on that. I will try with the latest data you gave me above, I'll try to augment the extensions to simulate the Rails environment.

An interesting question would be to know if the included class ancestors show above are the same with GitLab 11.8.2 and 11.8.3.

I also would love to know if you can confirm that GitLab 11.8.3 really rids you of the endless loop.

The worst case scenario for me? Having to install GitLab and try for myself, by the way, what operating system are you working on? That's sounds like a lot of work. I would much prefer having just a simple script that proves that my code goes into an endless loop.

Now, it's time for me to go to bed. I hope the night and the rest will dissipate my confusion and provide me with inspiration.

Best regards.

Thanks to @godfat for his guidance and patience!

godfat commented 5 years ago

Thank both of you trying to get to the bottom of this!

This really looks bizarre and I did try to look at relevant code mentioned in https://github.com/floraison/fugit/issues/15#issuecomment-475229163 and I almost wanted to ask if it keeps re-enqueuing the same thing over and over, but after looking at it closer I don't see why that will happen, so in the end I didn't ask that.

I actually started to think it might be a bug in sidekiq-cron but I don't really have a clue yet. I am not really familiar with sidekiq either.

An interesting question would be to know if the included class ancestors show above are the same with GitLab 11.8.2 and 11.8.3.

@jmettraux It should be the same. Here's the diff between 11.8.2 and 11.8.3, you can see that there's actually only one line change and that shouldn't have anything to do with this. I think this might be a very very rare edge case...

Here's when GitLab completely moved to Fugit 1.1.7. It was introduced in 11.8 which was released on 2019-02-22.

Or maybe it's really a bug in sidekiq-cron. I'll check the version updates in GitLab for sidekiq-cron tomorrow. Time to sleep for me, too.

conet commented 5 years ago

This is how it looks on 11.8.3:

sudo gitlab-rails console
-------------------------------------------------------------------------------------
 GitLab:       11.8.3-ee (822e412ad9a)
 GitLab Shell: 8.4.4
 postgresql:   9.6.11
-------------------------------------------------------------------------------------
Loading production environment (Rails 5.0.7.1)
irb(main):001:0> p Time.class.ancestors
[Class, StateMachines::MacroMethods, Module, ActiveSupport::Dependencies::ModuleConstMissing, Module::Concerning, N
akayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslatio
ns, FastGettext::Translation, ERB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, JSON::Ext::Generat
or::GeneratorMethods::Object, ActiveSupport::Tryable, Kernel, BasicObject]
conet commented 5 years ago

I have to tell you that I'm in almost the same position, devoting free time to this just to make sure others will not go through the same problems as I did, with the only difference that I don't maintain any of the involved projects.

Remember this is solved for me with the upgrade to 11.8.3-ee and based on the monitoring data it appeared around march the 14th (I'm not sure that because I upgraded to 11.8.2-ee, from 11.8.1 or that I enabled ldap sync) by changing the cron from 5 * * * * * to 5 * * * *. So if you think that this is an edge case or something like that, fine with me, it might really be so. But I can definitely say that it's there and if it happened to me it probably happened to others, others just might not see it that fast, I was lucky to run gitlab on a 1 CPU machine (I was developing infra as code).

About the OS: this is running on GCP using a fresh ubuntu-minimal-1804-bionic-v20190320 image with gitlab installed. When I've got feedback this morning all I had to do to reproduce it is to setup a new instance running 11.8.2-ee using one of the snapshots of our data disk. But I'm really not willing to spend more into digging into data.

conet commented 5 years ago

Please help me with some log statements to debug what is going on in cron.rb's previous_time log iterations, dump state, whatever helps, there is no such thing as a loop which goes on forever without a reason. Better to look directly at the core of the problem than scratching the edges.

conet commented 5 years ago

I just updated my comment, it wasn't 11.8.3 that fixes stuff but changing the cron's precision, I was already running 11.8.3, see this comment. I might be that the scheduler does not support jobs which take longer than the scheduled interval.

jmettraux commented 5 years ago

OK, that's great, let's go on working together.

Remember this is solved for me with the upgrade to 11.8.3-ee and based on the monitoring data it appeared around march the 14th (I'm not sure that because I upgraded to 11.8.2-ee, from 11.8.1 or that I enabled ldap sync) by changing the cron from 5 to 5 *.

Sorry, I can only remember what is written here and at https://gitlab.com/gitlab-org/gitlab-ce/issues/59273. Thanks for clarifying that because I was working with the "5 then 4 or 5 stars" on 11.8.2-ee only case. Mostly because of this comment.

I might be that the scheduler does not support jobs which take longer than the scheduled interval.

That's a very important piece of information, I wish I had it earlier. So your LDAP jobs took more than one minute it seems.

Please help me with some log statements to debug what is going on in cron.rb's previous_time log iterations, dump state, whatever helps, there is no such thing as a loop which goes on forever without a reason. Better to look directly at the core of the problem than scratching the edges.

OK, but please give us a clear, up-to-date picture.

jmettraux commented 5 years ago

Hello,

could you please use the following monkey patch and try again? https://gist.github.com/jmettraux/5e998fa0ed4981e4239243a034903bb0

I guess, @godfat will certainly correct me, that it can be placed at the end of https://gitlab.com/gitlab-org/gitlab-ee/blob/v11.8.3-ee/config/initializers/sidekiq.rb to take effect after restart.

It should emit a log warn entry every 10_000th iteration in the endless loop and tell us more about it.

Thanks in advance!

conet commented 5 years ago

Finally, here's the current end, because it goes on and on:

Fugit#previous_time \ {:i=>180840000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180850000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180860000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180870000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180880000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180890000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180900000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180910000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}

and the beginning:

Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 23 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"50 00 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"10 0 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"50-59 22 * * 0", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"19 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 0 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 */2 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 12 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 12 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 */1 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"17 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>10000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>20000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>30000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>40000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>50000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>60000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>70000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>80000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>90000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>100000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
conet commented 5 years ago

ti1 - ti seems to be different, I hope you can figure out why...

conet commented 5 years ago

Anyway, it seems like an number of iteration based circuit breaker would be a much safer idea.

jmettraux commented 5 years ago

Anyway, it seems like an number of iteration based circuit breaker would be a much safer idea.

Yes, I need to find the right max number (quick to reach and safely irrealist).

conet commented 5 years ago

Anyway, it seems like an number of iteration based circuit breaker would be a much safer idea.

Yes, I need to find the right max number (quick to reach and safely irrealist).

I'd go with a value that would limit the call to under a couple of seconds on an idle CPU, better fail in some strange situation than never return.

jmettraux commented 5 years ago

I'd go with a value that would limit the call to under a couple of seconds on an idle CPU, better fail in some strange situation than never return.

You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.

Still I have to determine why -1 turns into -0.

conet commented 5 years ago

You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.

Yeah but that has the disadvantage that it might mean more or less iterations depending on various factors including current load so it might actually make this more unpredictable, even if you figure out why -1 turns into -0 (some type overflow? is that possible in ruby?) I'd still add a fixed number of iteration. Something like that 41 years and a few days... there wont'be a next or a previous time but expressed as the number of iterations that would be enough to reach that state under normal circumstances.

jmettraux commented 5 years ago

OK, installing a v11.8.3-ee on Debian 9.8.0. I'll use the .deb at https://about.gitlab.com/install/

Anything I should be aware of?

conet commented 5 years ago

Yeah, in order to avoid using EE specific cron jobs which might need a license I would add one of these jobs in the config with a cron definition like 10 * * * * *. After starting follow this log file for cron related logs:

/var/log/gitlab/sidekiq/current

If you apply your last patch the logs will show up in this file:

/var/log/gitlab/gitlab-rails/production.log

The endless loop should kick in anywhere between 5-60 minutes after start so please be patient. I wouldn't wait for more than an hour, if it does not happen I will also try to reproduce in on an empty, shareable instance.

jmettraux commented 5 years ago

Thank you very much.