Open carlossanlop opened 11 months ago
on one hand, I'm not sure why we set $HELIX_PYTHONPATH
and $PYTHONPATH
or otherwise mess w/ Python inside Docker containers. @dotnet/dnceng does anyone remember why we do this in every "contained" work item❔ (MattGal's !23438 made a small change in this area but history starts w/ alexperovich's !3044.) my best guess is we decided to validate queues using Python even inside containers.
on the other hand, the recent runtime changes were back to a previously-used Alpine Docker image. seems like something changed w/ the various Python packages while they were using a different (newer) container.
bottom line, we don't have much experience w/ the floating pip
version we download and install in these containers.
"fix" choices may include
virtualenv
but it's not clear customers need anything more than some Python version in the container. our testing should also work fine w/o those packages installed if we stop checking the helix-scripts/ version.pip
version installed in our *-helix-*
containers. the ^^ Dockerfile just grabs the latest version at line 29cffi
version to one that has a usable .whl available for this platform. might be doable but could require some experimentation/cc @epananth and @davfost due to Docker implications /fyi @lbussell as well /cc @missymessa since doing something here seems urgent and you're on SSA /cc @ilyas1974 and @garath in case you instead want to handle this issue in Operations
since doing something here seems urgent
Affects customers, so I believe the proecss is that SSA should triage and establish impact
set aside my general comments about why we need helix-scripts/ or to meet its Python requirements in Docker containers. removing that everywhere could break some customers. we could add an issue to the Docker epic around finding whether our customers need Python in containers and (maybe) removing it more broadly. a total removal wasn't intended to be my (1) choice.
rethinking the options above, we seem to have four choices. three involve updates to https://github.com/dotnet/dotnet-buildtools-prereqs-docker/blob/main/src/alpine/3.15/helix/arm32v7/Dockerfile and the runtime team picking up a new Docker tag; the last changes https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines?path=/resources/helix-scripts/runtime_python_requirements.txt and requires a helix-machines rollout (currently planned for the 29th). updating the Dockerfile likely involves reverting @hoyosjs' recent change which may have broken the runtime team (so that the next tag doesn't cause similar issues).
# Install Helix Dependencies
portion of this particular Dockerfile. the intersection of pipelines testing in this container w/ those using Python in their tests should be very small, if it exists at all. we don't use this container in DncEng testing and have no other requirements on our side for helix-scripts/ in any container
# build MsQuic as we don't have Alpine packages (yet)
portion of the Dockerfile above the # Install Helix Dependencies
portion. this at least has a chance of working if the former portion doesn't delete as much, leaving gcc
and similar OS packages around for the cffi
buildpip
version to 20.3.4
(changing the python3 -m pip install --upgrade pip==22.2.2
line). this should automatically limit the cffi
version (see below) because lower pip
versions narrow what package versions are availablecffi==1.15.1
in our runtime_python_requirements.txt
file. Build #20231115.10 created a Mariner image, found this version of cffi
, and got a compatible wheel i.e., it didn't need to compile anything. one caveat: that was a Mariner 2 image, not 3.15 but this shouldn't be relevant here(1) sounds fastest but is also somewhat "dirty" in that it makes a -helix-
container non-Helix and could be considered a breaking change. (1) gets cleaner if there's already a satisfactory MCR image or the edits create a new image w/o -helix-
in its name.
@carlossanlop can you tell us what the overall impact of this error is for your PR?
I am unsure of the impact to my team, I was hoping dnceng could help initially determine if this was something to worry about or not. But maybe there's someone from @dotnet/runtime-infrastructure that might know if this could affect runtime.
I only happened to notice it in an expected test failure in an unrelated PR when reading the console log, but it probably happens in all runs using the same queue, it just does not get reported as an error at all.
rethinking the options above, we seem to have four choices. three involve updates to https://github.com/dotnet/dotnet-buildtools-prereqs-docker/blob/main/src/alpine/3.15/helix/arm32v7/Dockerfile and the runtime team picking up a new Docker tag
We unfortunately cannot update the tag at the moment, as we are blocked by an external issue (https://github.com/microsoft/msquic/issues/3958) which heavily impacts our CI (https://github.com/dotnet/runtime/issues/91757) -- this was the reason we pinned the tag in the first place (https://github.com/dotnet/runtime/pull/94609)
cc @wfurt @ManickaP
I wonder if this is a problem only 3.15 has? Because 3.15 is EOL now. We were looking into updating to 3.16 (but it is now blocked by the issue above as well)
Building and installing dependencies when the container start is going to be expensive and unreliable IMHO. I was under impression that the whole prereq repo exists to have everything ready to run the tests.
If the images are meant for testing, we should perhaps add step to verify that Helix is able to run.
And I think we can temporarily disable Quic tests on Alpine arm32 @CarnaViire. Not great but we should not block everything else IMHO. (or we can investigate and fix msquic ours selves or move back to released version instead of tracking current main)
@wfurt yeah, I was thinking about fixing the Arm32 alpine image on a specific dotnet/msquic commit instead of main, to be able to "unlock" the image for others -- let's continue this discussion in https://github.com/dotnet/runtime/issues/91757
Building and installing dependencies when the container start is going to be expensive and unreliable IMHO.
what's happening here is a quirk of how Helix runs commands in a Docker container. we execute sudo $HELIX_PYTHONPATH -m pip install --disable-pip-version-check -r $HELIX_SCRIPT_ROOT/runtime_python_requirements.txt
prior to running whatever command(s) the Helix job specified. similar commands that run on startup of a Helix VM don't seem to be hitting similar issues.
as I hinted above, I'm not quite sure of the overall reasoning except that we run pylint
on our helix-scripts/ code even in Docker containers.
let's continue this discussion in dotnet/runtime#91757
please let us know whether we need to help anything once you have a Docker tag containing the "right" msquic
. I suspect that'll mostly enable a change to the Dockerfile allowing the problematic command to succeed (see my suggestions a few comments up)
@carlossanlop, @CarnaViire, @wfurt I see runtime builds continue to hit this issue but I'm not sure why it remains in this repo. could we please move it to dotnet/runtime since we aren't helping❔
@dougbu we didn't update runtime yet to consume the rolling docker image (that's why it still shows up), but I will do that shortly.
But I didn't do any changes in the docker image beside fixing the msquic version. Would just a rolling image be enough, or are any additional changes needed?
are any additional changes needed?
I don't know what'll happen next. it's possible the msquic
problem will block observation of the cffi
issue.
if not, solution would be additional Dockerfile changes I described above in a fair amount of detail.
it's possible the msquic problem will block observation of the cffi issue
But there will not (should not :D) be a problem with msquic anymore... it is mitigated within the dockerfile in https://github.com/dotnet/dotnet-buildtools-prereqs-docker/pull/933. @wfurt can you please help with the additional changes?
this is more complicated and beyond Alpine IMHO. @lbussell disabled builds for Debian 11 on ARM32 as it was failing in similar way. I spent last few days fiddling with ARM32 builds and here is what I have found.
There are no binary wheels for ARM32. Forcing new pip
will not bring the expected cryptography or other packages. There are binaries available from 3rd party providers but that would mean we would depend on untrusted sources. And I personally feel that is not desirable.
So building the dependencies from sources is only or option IMHO. That can happen either during image creation (like the Debian 11 in pre-req repo or when Helix queue us set up) or during test execution (like this particular case)
I would strongly suggest to focus on the first one as doing it on every run is going to be expensive and unreliable IMHO. I know current behavior was put in with the idea that the Helix can fix itself and make it working but the practical outcome does not seems to be the case. It would also require us to put build tools to test images - making them bigger and we would pay the cost over and over again.
Now, the problem with Debian is that the new Cryptography
requires newer Rust
. And here the story repeats.
We can fetch binaries from 3rd parties or use rustup.rs script. Both options having problem with trust and authenticity IMHO. We could build rust from sources like we did clang & gcc for Centos 6 but that is going to be high maintenance cost.
Me recommendation would be to relax the package version requirement and perhaps disable the runtime attempts to install what it wants. I feel it would be better to fail clearly and prompt action instead of consuming resources and drag unnecessary baggage. And all this is difficult to test e.g. we see vases like this when something start suddenly failing without obvious reason.
I know there was argument about security for the update. But as far as I can tell the cryptography 3.2.1 from Debian does not have any known flow and #929 was working fine AFAIK. While we may use newer packages on mainstream images having viable option on Alpine & Debian is also important. The security patches some from OS vendor - just like just anything else in the OS image.
So I think there is action for Helix/infra team @dougbu. The MsQuic regression is out of picture now as @CarnaViire mentioned. But we still need to figure out package and dependencies for ARM32 - unless we can convince PMs and leadership to drop support for it.
Long-term we should probably also improve test coverage for Helix. .NET supported OS matrix should be guiding document and we need some solution for odd-balls - like Centos 6 and Ubuntu 16 was until recently.
I'm happy to share more details and/or share my dev environment - Machine from Helix dev pool and my personal Raspberry.
@wfurt I'm on vacation but am wondering about options here. We can't generally lower the requirements on our side but we may be able to special-case the problematic Docker containers. What specific change are you requesting❔
Backing up a bit… Is the issue here about cffi
, cryptography
, or both Python packages❔ I'm a bit confused by your most recent message @wfurt.
I suspect the solution will be to special case an upper bound for the cryptography
and / or cffi
package using a platform_machine
dependency specifier in our Python requirements. However I'm not positive how Python identifies the problematic ARM32 machines.
@wfurt (or anyone w/ access to an ARM32 Alpine 3.15 or Debian 11 box) could you please run the following short Python script on your test machines and let me know the results (I'm hoping the OSes are consistent)❔ I suspect armv6l
is a potential value but that's just a suspicion.
import platform
print (platform.machine())
I could likely create PRs somewhere to test but it should be faster if someone already has a box or VM available.
@dougbu JFYI: @wfurt is OOF today and tomorrow (and I don't have anything like this)
I get armv7l
on my Raspberry PI. For the Jetson machines I get aarch64
. The CPU is arm64 but we run 32bit containers on top of it e.g. it cannot run and link 64bit code e.g. the 'machine` may be misleading.
Support for armv6l
is community driven AFAIK. We have one pipeline AFAIK but that should not show up during "normal" runs.
I think it'll be pretty easy to special case armv61
and armv71
, but…
For the Jetson machines I get
aarch64
.
that's not good b/c aarch64
wheels do exist and we should use them for safety wherever possible.
however, sys.platform
, os.name
, platform.release()
, and platform.version()
all have corresponding dependency specifiers. could you check which of those should clearly indicate the Docker images that armv61
and armv71
won't help with please❔
I'm not sure what the comment means @dougbu.
Can you use architecture
together with the machine
?
This is for 32bit docker image
>>> print (platform.machine())
aarch64
>>> print(platform.architecture())
('32bit', '')
base machine shows 64 as expected
>>> print (platform.machine())
aarch64
>>> print(platform.architecture())
('64bit', '')
@wfurt I'm looking for a way to specify a cffi
and / or cryptography
requirement specifically for the problematic platforms. something like cryptography>=39.0.1,<=41.0.0;platform_machine=="armv61 " or platform_machine=="armv71" or ...
. the https://packaging.python.org/en/latest/specifications/dependency-specifiers/#environment-markers table shows all of the environment variables available and the corresponding Python calls e.g. import sys; print(sys.platform)
for the sys_platform
environment variable.
… new
Cryptography
requires newerRust
. And here the story repeats.
After checking old available wheels, I suspect this is the crux of the problem. There are no ARM32 wheels listed there, likely meaning the Docker container has always been building cryptography
and perhaps other Python packages from source. cryptography
's changelog indicates they've bumped their Rust requirements fairly frequently.
With that in mind, we likely need to downgrade cryptography
(and more) to the point where it's buildable again. What is the minimum Rust version used in these containers❔ Better, what is minimum of the the maximum Rust versions available for installation in the problematic containers❔
There are alternatives but none of them are quick:
Note that resolving the open questions about identifying the problematic systems in a way we can use in our requirements file and choosing the cryptography
version to use there will not resolve anything this year. Changes will require a helix-machines roll out and we're not staffed to do that until next year.
/cc @ilyas1974 and @markwilkie ^^ for awareness of the state here 😦
I check the requirement semantics @dougbu and it is going to be difficult IMHO. The versions report kernel version not the actual user mode setup. In case of containers the kernel version comes from the base OS (like Ubuntu 18/20) and I did not figure out how to get any reasonable info about the actual container.
As far as the cryptography: It seems like prior to Cryptography 3.5
rust
was optional and the container builds were configured as such. New documentation suggest that this is no longer the case e.g. building cryptography
will alway need rust
as well. Or we would need to depend on binaries provided and maintained by the distribution owners.
We ditched Alpine 3.15, 3.16 provides
Rust 1.60
py3-cryptography-3.4.8
And Debian 11
Rust 1.48
python3-cryptography 3.3.2-1
I did not figure out how to get any reasonable info about the actual container.
Unfortunate none of the Python "environment variable" options provide real information about the container.
I see https://github.com/dotnet/dotnet-buildtools-prereqs-docker/blob/7d0f9b0c308f54dc80bf275b69dc41d1b1856109/src/debian/11/helix/arm32v7/Dockerfile#L4C5-L4C26 sets $_PYTHON_HOST_PLATFORM
. Does that change any of the supported settings❔
It seems like prior to Cryptography 3.5 rust was optional
Right. We might however be able to muddle through w/ Rust 1.48 in the relevant containers (while installing a compatible cryptography
) and other workarounds. That would not leave Rust in the containers (unless it's already there) if we took the same approach as the Alpine 3.1.6 Dockerfile currently does for e.g. cargo
.
Must admit however that we're talking about layering hacks on hacks. I prefer the alternatives I mentioned, even if they take longer to implement or get buy-in for.
didn't forget this issue during the holiday break. current status is I'm waiting for DDFun to set up a Raspberry Pi system for me to test things out on. that at least will allow me to experiment w/ an ARM32 system and get things working again, hopefully with near-current cffi
and cryptography
packages.
on the other hand, the "real" fixes here are for just Alpine 3.17 and Debian 11 on ARM32, correct❔ want to make sure I'm not missing another platform combo you're having problems with what looks like this issue.
it is rolling battle - but I think Alpine & Debian 11 are the tail at this moment @dougbu
/fyi this remains very much on our radar, especially because we provide Raspbian queues likely affected in a similar fashion. any summary of how (or if) people get around these issues will help w/ VMs as well as Docker containers.
our resource constraints and urgent requirements elsewhere mean I'm not busy investigating at the moment
I just found out about this issue which is surprisingly similar to the one I'm hitting in a completely different scenario: I'm maintaining python packages for the SynoCommunity using Synology Linux DSM for multiple archs (armv5-7-8, i686, ppc, x86_64).
Somehow, can't tell when exactly, cryptography
stopped building correctly due to ?newer? cffi
. This comment from @dougbu https://github.com/dotnet/dnceng/issues/1423#issuecomment-1815783936 caught my attention. I am experimenting a ugly hack https://github.com/SynoCommunity/spksrc/pull/6200/commits/0bb1f00a01ba9702e63d38331ece10413db1534e that make use of that part of my PR https://github.com/SynoCommunity/spksrc/pull/6200 when generating my crossenv :
1) severely downgrade pip
|setuptools
|wheel
2) install cffi
+ cryptography
(+ older msgpack
as failing on arm later-on when installing maturin
and friends)
3) re-upgrade pip
and friends
4) pursue with the installation of my other perequisites into my virtualenv
Totally not elegant but looks like working?! Further, this problem seems specific to python 3.10 as building ok with 3.11. Just saying that I'll keep on eye on this thread, and let you know if I find a reasonable fix or workaround. Cheers!
Build
https://dev.azure.com/dnceng-public/public/_build/results?buildId=469944
Build leg reported
https://dev.azure.com/dnceng-public/public/_build/results?buildId=469944&view=logs&j=fd49b2ea-3ee3-5298-d793-bdb9fd631e7e&t=26d44400-a1b8-5ed1-d719-0cc35b38251b
Pull Request
https://github.com/dotnet/runtime/pull/93906
Known issue core information
Fill out the known issue JSON section by following the step by step documentation on how to create a known issue
@dotnet/dnceng
Release Note Category
Release Note Description
Additional information about the issue reported
Happens in (ubuntu.1804.armarch.open) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.15-helix-arm32v7-20230807201723-7ea784e
The repo had some recent PRs modifying the alpine images:
My PR is not making any changes that could be causing this.
This error is only visible when there's an actual test error happening in these machines, otherwise, the CI leg passes.
Log file with the error: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-93906-merge-9e1b570d2b3441d698/System.IO.Compression.Brotli.Tests/1/console.b000a501.log?helixlogtype=result
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=469944 Error message validated:
Building wheel for cffi (pyproject.toml) did not run successfully.
Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 11/15/2023 12:17:17 AM UTCReport
Summary