Helix "run scripts" should not run in user-defined docker containers

jkoritzinsky commented 1 year ago

[ ] This issue is blocking
[X] This issue is causing unreasonable pain

Today, the Helix run scripts run in the same docker container as the user's test, the container that the user requested. However, it causes a few different issues:

All containers that run user tests must also have the prerequisites to run the Helix scripts.
All Helix scripts must be able to run on all currently-used docker images.

We've hit an issue with this in the dotnet/runtime test infra and it will be a blocker for a future idea I had to reduce intermittent infrastructure failures in our CI.

With the runtime's "Redefining Merge on Red" initiative, we were going to move our test result format from XML to a streamable format (we initially chose YAML because it's a well-defined format), but we had to go back to the drawing board because one of the images that the runtime team runs tests on (OpenSUSE) has too old of a version of one of the dependencies of the YAML parsing library in Python. If the Helix scripts ran in a well-defined container that the Helix team controlled, then we could ensure that the image the Helix scripts run in has the required dependencies.
Docker has (currently in Beta), support for WASI-based containers. XHarness has some significant stability issues due to the inherent cross-process model it uses, and we could avoid this for testing most of our WASM legs (which in the current design of XHarness will cause significant pain for good "crash during a test run" results reporting) by running the tests as a regular "desktop"-style test run in a WASI container. However, since the Helix scripts need to run in the same container as the test process today, we can't even consider using the WASI containers as they wouldn't be able to run the Python scripts that Helix provides.

If Helix would be able to move to run the Helix run scripts in their own container that has shared volumes with the user test container, then the Helix scripts would be able to run in an environment with known dependencies and the user containers would be able to use more "exotic" containers to run their tests more efficiently or reliably.

Release Note Category

[ ] Feature changes/additions
[ ] Bug fixes
[ ] Internal Infrastructure Improvements
Release Note Description

MattGal commented 1 year ago

It seems like there are several things being asked for here. My initial sense is that if you have an unusual docker-y scenario, you should just drive the containers yourself from a work item; all dockerized Helix queues allow non-Docker work items, and we do our best to manage containers between runs, such that if a user did start a bunch of them we should be able to clean these up between work items barring some kind of bug in the client.

I’ll comment about first these statements:

All containers that run user tests must also have the prerequisites to run the Helix scripts.
All Helix scripts must be able to run on all currently-used docker images.

These are not actually “musts”, rather something we strive to provide because of a goal of making it so a given helix work item works the same both inside and outside of a docker container. Historically, we had assumed that users would use our python libraries directly for things like uploading results, but very few actual users import and call our python libraries. Presently the only thing that does this is the packing test reporter, and if there was truly some kind of need it could be implemented to be entirely independent of the dependencies of Helix. We don’t actually run the helix client at all inside a docker container, and the only actively used import from helix scripts is here, where we reference DefaultTestReporter, AzureDevOpsReportingParameters, and PackingTestReporter. This is meant to guarantee that any fixes or improvements made in reporting are shared between the work items and is something I’d prefer we not change.

Next, regarding the PyYaml comment, this had nothing to do with Docker containers, rather with the broad variety of non-Docker linuxes we support. I was unable to find a combination of installing pyYaml on all these images that worked on all supported images. Once some of our older images (like Ubuntu 16.04, SLES 12/15) age out, there’s a very real possibility we’ll be able to reevaluate this and use the library on all helix agents again. It would be possible, though we decided it to be undesirable, to fork what packages we install and only install pyyaml where it doesn’t break VM image creation. This would, however, break your yaml-based test reporting on these systems.

For the WASI question, the link you shared literally says “We recommend that you do not use this feature in production environments” but I can comment that on Ubuntu docker scenarios we’re already using containerd image store and regularly update our docker artifacts, so this might already be available to you today (via sending ‘non-docker’ work items and driving docker yourself.)

Finally, given the comment “XHarness has some significant stability issues due to the inherent cross-process model it uses”, I would point out that this type of unstable scenario is what this issue is asking us to impose everywhere in Helix. It’s definitely something I’d like to avoid without an unavoidably good reason to do so.

markwilkie commented 1 year ago

My initial thought is to have a chat @jkoritzinsky to see if the shared test infra plans would work for your needs. Basically, Matt's suggestion that you drive the containers yourself makes sense to me too. However, the shared infra layer should give additional tools/hooks/context to do that more sanely.

cc/ @riarenas, @alexperovich, @AlitzelMendez, @epananth

dotnet / dnceng

Helix "run scripts" should not run in user-defined docker containers #1239

Release Note Category

Release Note Description