ACCESS-NRI / MED-condaenv

A repository for the squashfs'd MED conda environments
Apache License 2.0
0 stars 0 forks source link

Merge Upstream #57

Closed rbeucher closed 1 year ago

rbeucher commented 1 year ago

Hi @dsroberts

I have merged your changes and I am trying to deploy a new version.

Tests are timing out so I might need to increase the walltime but I also get that in the error log:

No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      gadi-cpu-clx-1614
  Framework: pml
dsroberts commented 1 year ago

@rbeucher haven't seen this failure mode before. Can you put the squashfs somewhere I can see it so that I can have a look?

rbeucher commented 1 year ago

I think it's in the staging folder in the admin folder of the xp65 project. You need to be part of xp65_w to have access. I can give you access if you send me request. Thanks a lot @dsroberts !

dsroberts commented 1 year ago

@rbeucher Instead of joining xp65_w where I can potentially break things, could you copy it into some place I can read it? Send me the path on slack once the copy is complete.

rbeucher commented 1 year ago

Hi @dsroberts ,

Did you get a chance to look into this? I tried again today and I got the same pb.

R

dsroberts commented 1 year ago

Hi @rbeucher

Sorry for the delay, between the ACCESS workshop, kids sick at home and other projects this slipped off my radar.

I got back to it today and I've figured out the issue. UCX is being bought in to your conda environments, which is conflicting with the system installation of UCX which is used by OpenMPI. Remember that in these environments the OpenMPI installation from conda-forge is replaced by the system OpenMPI installation. mamba repoquery suggests that pyarrow version 12 is bringing UCX in. analysis3-unstable is currently using pyarrow 11, which is why we haven't seen this. We also don't have esmvalcore installed. The import that is causing this failure is coming from somewhere inside that.

Our containerised environment does have ucx installed, but we pin the version to the latest available on Gadi (1.14.0) and add that to the replace_from_apps array in install_config.sh. That may work for you, but if it doesn't, the other option is to pin libarrow<12 in environment.yml.

Dale

rbeucher commented 1 year ago

Thanks Dale. I'm gonna try that

rbeucher commented 1 year ago

Just to clarify, you get UCX from ucx-py? right? And then replace UCX by the system one using the replace_from_apps array? ucx-py (and thus UCX) does not seem to be pinned.

dsroberts commented 1 year ago

Just to clarify, you get UCX from ucx-py? right? And then replace UCX by the system one using the replace_from_apps array? ucx-py (and thus UCX) does not seem to be pinned.

Yes, however, we host our own version on the coecms conda channel. The package metadata for that has ucx pinned to 1.14.0. The ucx-py available through conda-forge is ancient, I'm not sure how you'll go with that.

rbeucher commented 1 year ago

Thanks a lot @dsroberts. It works now.