GoogleCloudPlatform / scientific-computing-examples

Open Source examples using Google Cloud to solve various Scientific and Technical Computing problems.
Apache License 2.0
17 stars 11 forks source link

expose os login as a variable #57

Closed vsoch closed 1 year ago

vsoch commented 1 year ago

This will address one part of #56 to expose the enabling of os login. The one design choice that we might want to consider is to scope the "global" "enable_os_login" to be for the manager, since it seems like the others come from the compute/login node specs. Another question I'm curious about is why this particular variable is a string "TRUE" instead of a boolean?

But @wkharold we need to do a bit more debugging before this is good to go - I'm not allowed to login anymore.

$ gcloud compute ssh gffw-login-001 --zone us-central1-a
External IP address was not found; defaulting to using IAP tunneling.
WARNING: 

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth

myusername@compute.5038866934849735751: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Recommendation: To check for possible causes of SSH connectivity issues and get
recommendations, rerun the ssh command with the --troubleshoot option.

gcloud compute ssh gffw-login-001 --project=llnl-flux --zone=us-central1-a --troubleshoot

Or, to investigate an IAP tunneling issue:

gcloud compute ssh gffw-login-001 --project=llnl-flux --zone=us-central1-a --troubleshoot --tunnel-through-iap

ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

Let me know your thoughts! We basically need the username that shows up in the gid/uid maps to be the same. I don't care what it is.

wkharold commented 1 year ago

I'm not allowed to login anymore.

There may be a role/permission requirement for this to work. Looking into it.

wkharold commented 1 year ago

OK, finally. Apologies for taking so long to look at this.

Two things.

Thing one: we only to set enable_os_login at the cluster level and so don't need to change the login_node_spec and compute_node_spec definitions. While it might be possible to enable it some of the cluster nodes and not on others I think the uid/gid configuration would get really complex and hard to manager.

Thing two: speaking of uid/gid management turning off OS Login will almost certainly require some kind of synchronization with an external LDAP to keep them straight. Like you I had a hard time getting logged in but the one time I did get a login my uid/gid was not consistent from node to node so that on the login node I was wkh but on the compute node I alloc'd I was rocky.

Thing three (bonus): the issue with logging in seems to have something to do with the way IAP is configured on the network which is something I need to figure out.

vsoch commented 1 year ago

Gotcha - thanks for the update @wkharold and no worries on the delay! Let's just let this PR sit and die, and if/when you do more digging and I might be able to help I can offer. In the meantime I'll try manually setting things up, and using the original recipe. I tried this once last week (and ran into another issue) that I don't remember at the moment. We've been working on busting so my mind has been elsewhere but I'll come back to test this again soon.

mmm commented 1 year ago

closing at submitter request... please re-open as nec

vsoch commented 1 year ago

To add additional comment, when I made my own builds this issue seemed to go away (when I logged in I saw both my user name and the Google cloud name) so there is some issue in the setup here. But it's not an explicit issue anymore because I'm not using these configs here.

Thanks @mmm !