xcoulon commented 5 years ago

using the pglock library to obtain a lock in the database, in order to avoid concurrent processing (1 per pod) lock is acquired at the worker init and released at the call to its Stop() method which MUST be within a deferred call in the main.go, so that the pod that owns the lock can release it when it is shutdown (ie, during a deployment rollout)

Fixes #ODC146

Signed-off-by: Xavier Coulon xcoulon@redhat.com

codecov[bot] commented 5 years ago

Codecov Report

Merging #812 into master will increase coverage by 0.07%. The diff coverage is 83.23%.

@@            Coverage Diff            @@
##           master    #812      +/-   ##
=========================================
+ Coverage   78.43%   78.5%   +0.07%     
=========================================
  Files          93      98       +5     
  Lines        8995    9157     +162     
=========================================
+ Hits         7055    7189     +134     
- Misses       1420    1438      +18     
- Partials      520     530      +10

Impacted Files	Coverage Δ
authentication/account/service/user.go	`76.84% <100%> (+0.46%)`	:arrow_up:
authorization/token/worker/token_cleanup_worker.go	`82.6% <100%> (ø)`	:arrow_up:
migration/migration.go	`66.86% <100%> (+0.19%)`	:arrow_up:
...n/subscription/service/oso_subscription_service.go	`78.94% <66.66%> (-0.34%)`	:arrow_down:
configuration/configuration.go	`83.41% <77.77%> (-0.25%)`	:arrow_down:
worker/worker.go	`79.31% <79.31%> (ø)`
worker/lock_owner.go	`80% <80%> (ø)`
...nt/worker/user_deactivation_notification_worker.go	`82.6% <82.6%> (ø)`
...ication/account/worker/user_deactivation_worker.go	`83.87% <83.87%> (ø)`
worker/repository/worker_lock_repository.go	`92% <92%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 551a6e5...d78cdc9. Read the comment docs.

nurali-techie commented 5 years ago

@xcoulon I have two related things.

First, on freq link Does this mean that every 24 hrs from pod_start the worker will do the work ??

For ex, assuming pod started and got lock at 01-Jan 3.00 PM Run1 - 01-Jan 3.00 PM Run2 - 02-Jan 3.00 PM

If above is the case then I would rather prefer to run at some particular time during day. For ex, every day mid-night.

For same ex, assuming pod started and got lock at 01-Jan 3.00 PM Run1 - 02-Jan 00.00 AM Run2 - 03-Jan 00.00 AM

This will give benefit during debugging and trouble shooting. Like whenever we have issue with these workers, we always need to check logs of around mid night time.

Second, I understood that when pod start it tried to get lock, assume it's successful. Now when that pod shutdown the pod MUST release the lock. Here, graceful_shutdown become very important. link. I also see that @MatousJobanek has mentioned about one edge case where he is giving suggestion to use lease_duration.

Here, I would prefer that each time when pod about to start work, all running pod should compete for lock and one of the running pod will get lock. That pod will run the work and after completing the work it will immediately release the lock. With this, we no need to bother about graceful_shutdown of pod and related scenario.

Summary, here First and Second go hand in hand to make sure that all running pod complete for lock at same time (for ex at mid-night daily) and one of the pod get the lock to run the work for that particular instance.

xcoulon commented 5 years ago

@xcoulon I have two related things.

First, on freq link Does this mean that every 24 hrs from pod_start the worker will do the work ??

For ex, assuming pod started and got lock at 01-Jan 3.00 PM Run1 - 01-Jan 3.00 PM Run2 - 02-Jan 3.00 PM

If above is the case then I would rather prefer to run at some particular time during day. For ex, every day mid-night.

For same ex, assuming pod started and got lock at 01-Jan 3.00 PM Run1 - 02-Jan 00.00 AM Run2 - 03-Jan 00.00 AM

Correct: the worker would run the function every 24hr, not at a specific time of the day, AFAIK.

This will give benefit during debugging and trouble shooting. Like whenever we have issue with these workers, we always need to check logs of around mid night time.

Good point. I'll see if there's something closer to cron scheduling instead of a basic timer.

Second, I understood that when pod start it tried to get lock, assume it's successful. Now when that pod shutdown the pod MUST release the lock. Here, graceful_shutdown become very important. link. I also see that @MatousJobanek has mentioned about one edge case where he is giving suggestion to use lease_duration.

Here, I would prefer that each time when pod about to start work, all running pod should compete for lock and one of the running pod will get lock. That pod will run the work and after completing the work it will immediately release the lock. With this, we no need to bother about graceful_shutdown of pod and related scenario.

Right, that would mean that the locks are in released state most of the time. There's still the case where the pod is shutdown while a worker is deactivating users, though. But I can make this change (acquire lock before each cycle) easily.

Summary, here First and Second go hand in hand to make sure that all running pod complete for lock at same time (for ex at mid-night daily) and one of the pod get the lock to run the work for that particular instance.

alexeykazakov commented 5 years ago

@nurali-techie @xcoulon not sure what are you referring to when talking about midnight or other particular time of the day. Midnight for whom? What time zone? :)

nurali-techie commented 5 years ago

@nurali-techie @xcoulon not sure what are you referring to when talking about midnight or other particular time of the day. Midnight for whom? What time zone? :)

@alexeykazakov the solution has nothing to do with time_zone. The solution is more about running work on same_time for all run irrelevant of time_zone.

In case of schedule with 24-hrs, we mentioned to run at mid_night and for completeness we can say mid_night of pod/cluster time_zone but it can be any preferred time (with time_zone) we decide.

Take another ex, we also have schedule with every_hour for cleanup_worker link, for this we should try to run work at 1PM, 2PM, 3PM likewise rather than 1:10:30 PM, 2:10:30 PM, 3:10:30 PM with assumption that pod started and acquire lock at 1:10:30 PM.

nurali-techie commented 5 years ago

I'll see if there's something closer to cron scheduling instead of a basic timer.

@xcoulon we don't need cron scheduling mechanism. We just need simple logic for one or two schedule. I just try this to get next schedule for 24-hrs scheduling.

Code: https://play.golang.org/p/MycTNawM6wo Output on my system:

nowTime=2019-04-08 10:31:52.23250857 +0530 IST m=+0.000130172
todayRoundTime=2019-04-08 05:30:00 +0530 IST
tomorrowRoundTime=2019-04-09 05:30:00 +0530 IST

This is for 24-hrs schedule. For every_hour schedule, change d to d := time.Hour and it will give next_hour worker run time.

alexeykazakov commented 5 years ago

@nurali-techie see https://github.com/fabric8-services/fabric8-auth/pull/812#discussion_r272703493

We must be ready to deactivate 10000+ users per day. It's a full time job. It doesn't really matter when we start. So, let's not bother with any scheduling. Start the cycle as soon as it's ready.

xcoulon commented 5 years ago

@nurali-techie I changed the logic wrt to lock acquisition and release for every cycle in e7b23af

xcoulon commented 5 years ago

@alexeykazakov as you suggested, I disabled the workers in main.go in 75c386c

alexeykazakov commented 5 years ago

@xcoulon also @sbryzak has reviewed https://github.com/cirello-io/pglock and approved it for using :) The hearbeat/lease logic looks pretty robust and should guarantee proper locking and unlocking in case of hard crash. So, your PR looks good in term of locking/unlocking. We might want to relax the lease and hearbeat timeouts a bit though. 3s and 1s looks a bit too aggressive but it's not critical. Maybe 3 and 1 minutes? But it's minor.

nurali-techie commented 5 years ago

@xcoulon one feedback (optional idea), in case it relevant and helps.

Topic: release lock acquired by dead_pods:

Assume this is sequence of events. T1 - pod1 acquired lock T2 - pod1 crashed T3 - pod2 try to acquire lock

For above scenario, pod1 can be considered as dead_pod. If we have below mechanism either we can code or from pglock or from combination of both.

At T3, pod2 try to acuqire lock

try (acuire_lock by pod2) {
    if not locked {
        give lock to pod2 and return
    }

    if locked {
        live_pods = get list of live pods name
        for each live_pods {
            if live_pod.name == lock.owner_name {
                reject lock by pod2 and return
            }
        }
        // it means lock owner is dead_pod
        release lock
        give lock to pod2 and return
    }
}

I am not sure, if we can get list of live_pods (status=running pods for auth_service in openshift cluster). But if we can achieve above then our solution will become robust in case of lock acquired by dead_pods. And there are many edge cases where this will be applicable.

alexeykazakov commented 5 years ago

Topic: release lock acquired by dead_pods:

@nurali-techie this is already handled properly by pglock. If Pod1 crashed without explicitly releasing the lock the heartbeat routine form that pod will not updating the lease anymore and Pod2 will successfully grab the lock after the previous lock owner (Pod1) lease ended.

xcoulon commented 5 years ago

@alexeykazakov I've update the PR in b527a98 to call the OSO Reg App to deactivate the users. I suggest I take care of refactoring the code in the user service to expose a single DeactivateUsers(context.Context, *time.Time) method in a subsequent PR, as this is more of a cosmetic concern (ie, the business logic is covered)

xcoulon commented 5 years ago

[test]

fabric8-services / fabric8-auth

worker routine to notify inactive users #812

Codecov Report