Matgenix / jobflow-remote

jobflow-remote is a Python package to run jobflow workflows on remote resources.
https://matgenix.github.io/jobflow-remote/
Other
25 stars 11 forks source link

Handling Multi-Factor Authentication for workers #58

Open davidwaroquiers opened 9 months ago

davidwaroquiers commented 9 months ago

Following question from @JaGeo, opening here an issue on the MFA topic. Let's gather ideas, info, existing solutions, problems, ... related to the fact that clusters are slowly (or maybe rapidly ?) moving to MFA authentication. @gpetretto did some tests (could you maybe summarize insights here ?)

ml-evs commented 9 months ago

One "simple" approach that might work here (and that I have seen suggested on some machines) is to use SSH multiplexing, so that once an authenticated connection is created (by the user, with TOTP or whatever), then the connection is kept open and all further connections within the session go through it. This is handled with a simple socket file, so Paramiko/Fabric etc. should just seamlessly work too. This would require some structure whereby JFR notifies the user that a new TOTP code is required to keep monitoring jobs, and will depend on how stringently machines enforce these timeouts (in practice, the timeout can be infinite...). The relevant SSH config parameters are ControlPersist and ControlMaster (see https://www.man7.org/linux/man-pages/man5/ssh_config.5.html and e.g. the suggestions in the Cambridge docs https://docs.hpc.cam.ac.uk/hpc/user-guide/mfa.html#reducing-the-effort-of-mfa-connection-sharing).

ml-evs commented 9 months ago

Also related is the previous discussion at https://github.com/Matgenix/qtoolkit/issues/14

JaGeo commented 9 months ago

Would this also require a setup without password? We have both, password and MFA at the moment. Using a key pair does not help with the password.

gpetretto commented 9 months ago

I think in general it would not require a password (at least from my tests). The problem is that unfortunately it seems that multiplexing is not (yet) supported by paramiko: https://github.com/paramiko/paramiko/issues/852. Until support is addeed I am afraid we should deal with this issue in some other way.

The fact that jobflow-remote should keep the connection with the host open would allow to pass the OTP when the Runner starts. The connection can still be closed and indeed it would be good if the user could be notified about that. However, I am not sure if there is a convenient way of doing that.

ml-evs commented 9 months ago

As an aside, I've just pushed #60 which can be used as a test bed for some of these approaches (both by manually building and launching the MFA-enabled Slurm container and testing locally, and by the eventual full JFR automation...). For now we should at least add clear error messages and a docs page about this until we have a real solution.

ml-evs commented 9 months ago

Also, I'm going to assume this isn't the case for the supercomputers in question, ~but at least for the google-authenticator-libpam implementation I'm using, secrets are stored in plain text in the user directory (so e.g., I can manually define a new emergency backup code for next login once I'm logged in). I'm going to take a wild guess that they are encrypted in production uses (with decryption key probably depending on supercomputer... but might simply be the user's password), so depending on the machine we might have some joy writing new encrypted backup codes that jobflow remote (alone) can use (e.g., at the start of each jobflow-remote "remote tick", generate, write and store a new TOTP emergency backup code, then use it for the next tick's login). Long shot perhaps!~

Again, at least for Cambridge, resetting TOTP requires a video call where you show Government ID, which I assume we don't want to try to spoof :sweat_smile:

gpetretto commented 9 months ago

I have a solution that "works" under certain conditions:

These can be relatively strict, but given the above limitations, I have tested JFR with a simple VM with a MFA based on google authenticator. Just setting an OTP as password in the project configuration file and start the Runner immediately worked fine. Of course, if this proves an effective solution, we can just add an option when starting the Runner to ease the process: e.g. jf runner start -otp 123456.

The limitation on having a single password prompted is not strictly a limitation for paramiko, but as far as I have seen it is not possible for fabric with built-in options. I would need a bit more time to check how to use the lower level paramiko machinery to properly set up the fabric Connection in that case.

I agree that it would be better not to mess with the token generation. I suppose in some cases this could lead to a ban from the cluster.

gpetretto commented 9 months ago

As an update, I managed to create a fabric connection even with password+OTP. It is a bit involved, but should be possible to implement it in jobflow-remote, if needed.

JaGeo commented 9 months ago

I am still testing with the cluster support to see if the key-pair connection could at least allow for a passwordless connection. They think it should work but, in practice, it does not work yet... I will keep you updated.

gpetretto commented 8 months ago

Update on this topic. I have managed to implement a solution to address this issue. In case anyone else is interested it can be found in this branch: https://github.com/Matgenix/jobflow-remote/tree/interactive. I will merge it after testing it more.

The idea is the following: if an OTP needs to be provided, when the daemon is started, the CLI will then allow to connect to the daemon process and interact with it (through supervisor's "foreground" option). In this specific case the Runner will immediately try to connect to the remote host and the user will be prompted for password (if requested) and OTP. This of course still has some of the limitations listed above:

I should add that the administrators of one computing center told us that storing the secret locally (even encrypted) is not considered an acceptable procedure for them. So I am afraid that the main limitations will remain for the moment.