falldamagestudio / UE-Jenkins-BuildSystem

Build Unreal Engine & games with Jenkins on GKE/GCE
MIT License
46 stars 11 forks source link

Dynamic Linux agents sometimes fail to launch: `SEVERE: Failed to retrieve SSH keypair for instance: <name>` #49

Closed Kalmalyzer closed 3 years ago

Kalmalyzer commented 3 years ago

This sometimes happens when setting up the new non-docker ssh linux VMs.

I think something is a bit off with the SSH key handling. Why are instance-specific keys being generated in InstanceConfiguration.instance(), assigned to a class variable, and used at a later point during the flow? Why are these keys sometimes missing during 1st launch for some instances?

Kalmalyzer commented 3 years ago

The reason for the ssh keypair member variables was convenience - someone needed to return two values from the same method, and chose to send one via a member variable instead.

The problem here was that the plugin had no place to persist keys once the agents were gone from Jenkins, and our new "wake up persisted instance" flow did not generate a new keypair + apply it to the instance.

https://github.com/falldamagestudio/UE-Jenkins-Images/commit/0228d23baaf1958771acde6feb9df337f758b6a0 makes the key handling on Linux support nodes that persist. For persisted instances, different API calls were needed to go out and modify the SSH keys (the old code path only worked when creating an instance from scratch).