Closed rbeucher closed 1 year ago
@rbeucher haven't seen this failure mode before. Can you put the squashfs somewhere I can see it so that I can have a look?
I think it's in the staging folder in the admin folder of the xp65 project. You need to be part of xp65_w to have access. I can give you access if you send me request. Thanks a lot @dsroberts !
@rbeucher Instead of joining xp65_w where I can potentially break things, could you copy it into some place I can read it? Send me the path on slack once the copy is complete.
Hi @dsroberts ,
Did you get a chance to look into this? I tried again today and I got the same pb.
R
Hi @rbeucher
Sorry for the delay, between the ACCESS workshop, kids sick at home and other projects this slipped off my radar.
I got back to it today and I've figured out the issue. UCX is being bought in to your conda environments, which is conflicting with the system installation of UCX which is used by OpenMPI. Remember that in these environments the OpenMPI installation from conda-forge is replaced by the system OpenMPI installation. mamba repoquery
suggests that pyarrow
version 12 is bringing UCX in. analysis3-unstable
is currently using pyarrow
11, which is why we haven't seen this. We also don't have esmvalcore
installed. The import that is causing this failure is coming from somewhere inside that.
Our containerised environment does have ucx
installed, but we pin the version to the latest available on Gadi (1.14.0) and add that to the replace_from_apps
array in install_config.sh
. That may work for you, but if it doesn't, the other option is to pin libarrow<12
in environment.yml
.
Dale
Thanks Dale. I'm gonna try that
Just to clarify, you get UCX from ucx-py? right? And then replace UCX by the system one using the replace_from_apps array? ucx-py (and thus UCX) does not seem to be pinned.
Just to clarify, you get UCX from ucx-py? right? And then replace UCX by the system one using the replace_from_apps array? ucx-py (and thus UCX) does not seem to be pinned.
Yes, however, we host our own version on the coecms
conda channel. The package metadata for that has ucx pinned to 1.14.0. The ucx-py
available through conda-forge
is ancient, I'm not sure how you'll go with that.
Thanks a lot @dsroberts. It works now.
Hi @dsroberts
I have merged your changes and I am trying to deploy a new version.
Tests are timing out so I might need to increase the walltime but I also get that in the error log: