Add periodic LDAP group / users sync

matthewelwell commented 1 year ago

In the flagsmith application we have written a django management command that synchronises LDAP groups / users with Flagsmith groups / users. This can be run via python manage.py sync_ldap_users_and_groups. We'd like a process for enterprise users to run this on a specific schedule.

My immediate suggestion is to use k8s cron jobs, something like the following.

apiVersion: batch/v1
kind: CronJob
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: ldap-sync
            image: flagsmith:${{ api.version }}
            imagePullPolicy: IfNotPresent
            command:
            - python
            - manage.py
            - sync_ldap_users_and_groups
          restartPolicy: OnFailure

However, there are perhaps some additional complications here (I think we might need to add it to the run-docker.sh bash script for example? There might also be a simpler / better way to achieve this.

gz#262

matthewelwell commented 1 year ago

Relevant PRs:

https://github.com/Flagsmith/flagsmith-docs/pull/276 https://github.com/Flagsmith/flagsmith-ee/pull/81

plumdog commented 1 year ago

@matthewelwell Yes, I think k8s cronjobs are probably the right way. But some thoughts on some of the tradeoffs and general thoughts:

k8s cronjobs run in the timezone of the underlying instances (it's actually a bit more fiddly, see https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#time-zones) but I don't think this really matters here.
once the schedule is set in the cronjob (by passing something to the chart values) a user within the web application can't then adjust it. A way around this would be the database to store the schedule, that a user can adjust, and then have a background worker look at the schedule and run the import when needed. But this means writing cron yourself which sucks and you should probably avoid if you can, which I guess you can.
how bad would it be if two instances of this command were running at the same time (eg, if I set the schedule as * * * * * and each run takes more than 1 minute, then the 12:01 command will start before the 12:00 command has finished)? Cronjobs can accommodate this (has a concurrencyPolicy that can be set to Forbid)
if the cronjob command exits non-zero, kubernetes will retry, up to a default of 6 attempts before declaring it "failed" and giving up. It's worth having a quick think about the following:
- does the command trap fatal configuration issues (like "you set the wrong credentials") and report them in the UI for a user to resolve? In this cases, retrying is pointless, so - perhaps counter-intuitively - you want the cronjob to exit 0, but you need a different way of telling a user they need to do something.
- but things that might be caused by eg race-conditions, or network blips, you do want to exit non-zero and retry.
- more generally, how will a user - other than running kubectl - know if the job has run successfully? Akin to the things above, this usually means the command gets some more careful "wrapping" to mean it does the right things when something goes wrong, precisely because "just throw and exception and let the user read it" doesn't work nicely in k8s cronjobs like it does if a human is running it themselves at a shell.

matthewelwell commented 1 year ago

once the schedule is set in the cronjob (by passing something to the chart values) a user within the web application can't then adjust it. A way around this would be the database to store the schedule, that a user can adjust, and then have a background worker look at the schedule and run the import when needed. But this means writing cron yourself which sucks and you should probably avoid if you can, which I guess you can.

I don't think we'd need to adjust it really and I guess, if needed, it can be adjusted by updating and redeploying the chart?

how bad would it be if two instances of this command were running at the same time (eg, if I set the schedule as * and each run takes more than 1 minute, then the 12:01 command will start before the 12:00 command has finished)? Cronjobs can accommodate this (has a concurrencyPolicy that can be set to Forbid)

This shouldn't be an issue.

Yes, that's a good point about the error handling / success reporting, etc. at the moment, any failures (network, configuration, etc.) would result in a non-zero exit code and no status would be reported. @gagantrivedi I wonder if therefore, we should add a field to the organisation (or a separate OrganisationStatisticsReporting model?) that reports the datetime of the last successful run of this job (perhaps we could include a previous_error field too?)?

I don't know that we'd necessarily be able to differentiate between network blips or credential failures, however. The code looks something like the following, so it looks like we just get None if the connection fails - no information about the reason.

auth_kwargs = {
    "username": settings.LDAP_SYNC_USER_USERNAME,
    "password": settings.LDAP_SYNC_USER_PASSWORD,
}
with ldap.connection(**auth_kwargs) as connection:
    if connection is None:
        raise CommandError("Could not connect to LDAP server")

Flagsmith / flagsmith-charts

Add periodic LDAP group / users sync #119