cybergis / cybergis-compute-core

Apache License 2.0
7 stars 6 forks source link

[Bug] All Anvil models needs updating #118

Open fbaig opened 1 month ago

fbaig commented 1 month ago

Due to a backend change on Anvil, all Anvil models, specifying mem_per_cpu option will results in following (similar) error

srun: fatal: cpus_per_task set by two different environment variables SLURM_CPUS_PER_TASK=2 != SLURM_TRES_PER_TASK=cpu:1

According to updated Anvil configurations, the memory is assigned automatically as 2GB/core, so mem_per_cpu option will be redundant and will cause issues on job submissions. Refer to following for further details https://github.com/I-GUIDE/container_images/issues/11

Possible Resolution

alexandermichels commented 1 month ago

Added an announcement to the UI to notify users: image

fbaig commented 1 month ago

Great, thanks. Is there a way to identify which models are configured to use Anvil on our end?

alexandermichels commented 1 month ago

There isn't a good way, no. I guess the best approach would be to go through this page (https://cgjobsup.cigi.illinois.edu/v2/git) and control find "anvil_community" then create an issue on each repo?

fbaig commented 1 month ago

The proposed solution mentioned above will only work for models using Anvil ONLY. If a model supports more than one HPC, removing mem_per_cpu option altogether may result in unexpected behavior on non-Anvil HPCs.

Is there a way to provide conditional configurations in manifest.json? If not, I think it may be easier to update cybergis-compute-core to ignore this parameter when submitting jobs to Anvil.

alexandermichels commented 1 month ago

I don't think it's a big deal. Globus on Keeling is broken, I don't think we currently have credits on Bridges or Expanse, and ACES requires per-user approval, so Anvil is the main Hpc being used currently.

We don't currently have a way to remove configs on a per HPC basis, only to add them. If Anvil can't fix this issue and modifying the manifests won't work, I can hack together a patch tomorrow and put it on production, but a longer-term solution might take a while because the code isn't set up for it.