dotnet / dnceng

.NET Engineering Services
MIT License
24 stars 19 forks source link

Helix "run scripts" should not run in user-defined docker containers #1239

Open jkoritzinsky opened 1 year ago

jkoritzinsky commented 1 year ago

Today, the Helix run scripts run in the same docker container as the user's test, the container that the user requested. However, it causes a few different issues:

  1. All containers that run user tests must also have the prerequisites to run the Helix scripts.
  2. All Helix scripts must be able to run on all currently-used docker images.

We've hit an issue with this in the dotnet/runtime test infra and it will be a blocker for a future idea I had to reduce intermittent infrastructure failures in our CI.

If Helix would be able to move to run the Helix run scripts in their own container that has shared volumes with the user test container, then the Helix scripts would be able to run in an environment with known dependencies and the user containers would be able to use more "exotic" containers to run their tests more efficiently or reliably.

Release Note Category

MattGal commented 1 year ago

It seems like there are several things being asked for here. My initial sense is that if you have an unusual docker-y scenario, you should just drive the containers yourself from a work item; all dockerized Helix queues allow non-Docker work items, and we do our best to manage containers between runs, such that if a user did start a bunch of them we should be able to clean these up between work items barring some kind of bug in the client.

I’ll comment about first these statements:

These are not actually “musts”, rather something we strive to provide because of a goal of making it so a given helix work item works the same both inside and outside of a docker container. Historically, we had assumed that users would use our python libraries directly for things like uploading results, but very few actual users import and call our python libraries. Presently the only thing that does this is the packing test reporter, and if there was truly some kind of need it could be implemented to be entirely independent of the dependencies of Helix. We don’t actually run the helix client at all inside a docker container, and the only actively used import from helix scripts is here, where we reference DefaultTestReporter, AzureDevOpsReportingParameters, and PackingTestReporter. This is meant to guarantee that any fixes or improvements made in reporting are shared between the work items and is something I’d prefer we not change.

Next, regarding the PyYaml comment, this had nothing to do with Docker containers, rather with the broad variety of non-Docker linuxes we support. I was unable to find a combination of installing pyYaml on all these images that worked on all supported images. Once some of our older images (like Ubuntu 16.04, SLES 12/15) age out, there’s a very real possibility we’ll be able to reevaluate this and use the library on all helix agents again. It would be possible, though we decided it to be undesirable, to fork what packages we install and only install pyyaml where it doesn’t break VM image creation. This would, however, break your yaml-based test reporting on these systems.

For the WASI question, the link you shared literally says “We recommend that you do not use this feature in production environments” but I can comment that on Ubuntu docker scenarios we’re already using containerd image store and regularly update our docker artifacts, so this might already be available to you today (via sending ‘non-docker’ work items and driving docker yourself.)

Finally, given the comment “XHarness has some significant stability issues due to the inherent cross-process model it uses”, I would point out that this type of unstable scenario is what this issue is asking us to impose everywhere in Helix. It’s definitely something I’d like to avoid without an unavoidably good reason to do so.

markwilkie commented 1 year ago

My initial thought is to have a chat @jkoritzinsky to see if the shared test infra plans would work for your needs. Basically, Matt's suggestion that you drive the containers yourself makes sense to me too. However, the shared infra layer should give additional tools/hooks/context to do that more sanely.

cc/ @riarenas, @alexperovich, @AlitzelMendez, @epananth