canonical / testflinger

https://testflinger.readthedocs.io/en/latest/
GNU General Public License v3.0
10 stars 17 forks source link

Support device name in addition to MAAS ID for MAAS device agents #127

Open rodwsmith opened 1 year ago

rodwsmith commented 1 year ago

Currently, we need to specify MAAS-deployed servers via their six-character MAAS device IDs (e.g., gwmhd6) in Testflinger configuration files. This fact can create problems if those device IDs change, as happens if a node must be re-enlisted. After re-enlisting a node, the relevant configuration files must be tracked down and changed to match.

It would be much better if Testflinger could start with a machine name and then extract the MAAS ID from a call to "maas admin machines read", for use in subsequent maas calls. The current method of directly using MAAS IDs would have to be maintained for backward compatibility, of course.

plars commented 1 year ago

This should be possible, however it will add some extra time and maas commands to every single provisioning run for that device. If you are saying that you frequently recommision your devices and the maas ID changes from time to time, then we could never trust a cache or anything and would need to re-request that ID on each run, unless maas now supports operations based on name.

The other possible obvious problem here, is that if you ever change the name, even just a bit, then you're right back at the same problem. At some point, you must have a unique ID for a device, and I'm not sure whether maas actually enforces uniqueness of the device name. The ID is definitely unique though, and not really that obscure - it's in the URL when you go configure the device.

I think this is doable, but comes with a few caveats if you choose to use it.

rodwsmith commented 1 year ago

Re-commissioning isn't the problem; it's re-enlisting the node that causes the ID to change. We do this infrequently, but when we do, it becomes a hassle to track down the relevant Testflinger files and change them; and if this step is forgotten, our deployments start failing, which ends up in a debugging queue. This can end up wasting hours of time. Taking a few extra seconds, or even a minute or two, for each deployment is a small price to pay to avoid these issues. (Our test runs normally take a day or so.) We're pretty careful about our machine names, and I think we'd be more likely to notice if one of them was wrong.

Thus, overall, I think we'd prefer using machine names as the only way to identify SUTs, rather than a mixture of machine names as MAAS IDs, as it is now, even given the caveats you've noted.

jocave commented 1 year ago

Is there some difference in typical certified server machine lifecycle that we need to understand to make a good decision here?

For example why do you ever (outside of some kind of unexpected maintenance or something) need to re-enlist a device? Once it is added to MAAS and setup in testflinger what events would mean it couldn't stay there ~forever?

rodwsmith commented 1 year ago

Usually a re-enlistment happens because we're experiencing severe problems with the server's MAAS page -- one node's page is misbehaving in some way, but all others are OK. These problems may be caused by MAAS bugs, of course, but rather than wait n months for a fix in MAAS, we usually remove the node's entry and re-create it. (We'll also file a MAAS bug report if we can pin it down well enough, of course.) Such problems might also happen because of database corruption. As I say, this is rare, but when it does happen, having to muck with Testflinger configurations just makes it harder. We just recently had this happen; a node was not deploying via Testflinger, and it took me a while to identify the cause as the changed MAAS ID code.

Another issue is that there are some device details that are extracted during enlistment but not when commissioning or deploying a node, and in the past, new MAAS versions have sometimes added to the available pool of information extracted during enlistment. When this happens, old nodes lack this new information. I don't recall precisely what's been added like this (maybe firmware versions?), but if we decided we'd need that information accessible from MAAS, we'd need to re-enlist nodes. As I say, I don't think we've ever re-enlisted a node for this reason, but it could happen.

There have also been one or two occasions in the past when we've had to completely re-create all a MAAS server's entries because of a catastrophic failure. This hasn't happened recently, thankfully, but it could happen in the future.

jocave commented 1 year ago

Do you have some testing commitments that mean you often need to interact with MAAS itself or check that some behaviour of it is working? I would expect that most consumers (people running test jobs) on Devices Cert lab hardware would never go anywhere near MAAS as all reservation, provisioning etc happens via testflinger.

rodwsmith commented 1 year ago

We always run actual certification test runs directly and manually, since doing it via Testflinger just adds another layer of complexity without providing any benefit. We use Testflinger for ongoing tests, as new kernels are released, etc.

We also use MAAS directly for initial setup (Testflinger isn't designed for that), for troubleshooting when a server starts to misbehave (removing Testflinger removes variables), when we need to test something manually (for instance, verify a bug report), when doing maintenance on servers (updating their firmware, for instance), etc. Also, sometimes people on other teams need access to our hardware, and that's generally done by giving them access to MAAS. (They often need to write new drivers or other software that relies on hardware we have in our possession.)

jocave commented 1 year ago

It does sound to me like currently there is an expectation that humans would be preforming mediation of access to the devices in your lab and this is somewhat at odds with the ethos that this should be handled by testflinger i.e. it's role is to time-share access to hardware resources continuously whether that be for automated jobs or for access by individuals for experimentation without requiring support from the owners of the lab.

I'm not advocating that we change that situation overnight, but I think it's something that we should keep in mind and discuss whether we can converge on.

rodwsmith commented 1 year ago

However people access servers in the Server Certification lab, the original point of this bug report remains valid: If the node must be re-enlisted (as occasionally happens), use of MAAS IDs rather than hostnames creates extra work. It would be better for Testflinger to use hostnames, which we can at least re-create when re-enlisting a node.