在超算集群中运行羲和遇到slurmstepd: error

作者您好，感谢您分享的预训练模型和代码。

我的运行代码过程遵循 README.md 的设置。具体如下：

1、下载 input_data、 models、 output_data、 src、 pycdo.tar.gz，并且按照推荐文件夹的结构进行配置； 2、将pycdo.tar.gz提取到对应文件夹，并运行source bin/activate确保正确的python环境； 3、编辑在超算集群中的提交任务脚本xihe.slurm：

#!/bin/bash
#SBATCH --job-name=inference_job
#SBATCH --output=inference_job.out
#SBATCH --error=inference_job.err
#SBATCH --time=01:00:00
#SBATCH --partition=kshcnormal
#SBATCH --nodes=5  
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=5  
#SBATCH --mem=8G  

cd  work/Xihe/src

python inference.py --lead_day 1 --save_path output_data

然后出现错误：

INFO:root:Loading input data from work/Xihe/input_data/input_surface_data/input_surface_20190101.npy
/opt/gridview/slurm/spool_slurmd/job70169366/slurm_script: line 14: 21016 Killed                  python inference.py --lead_day 1 --save_path output_data
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=70169366.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

此外，还进行了下面的尝试来解决问题： 1、调整批处理大小大(如下脚本所示)，仍然报同样的错误： inference.txt.txt

2、增加mem的值，报错：sbatch: error: Batch job submission failed: Job submission failed because too much memory was requested relative to the number of CPUs requested. The requested memory:CPU should be kept no more than DefMemPerCPU

希望作者能够提供帮助，感激不尽！

Ocean-Intelligent-Forecasting / XiHe-GlobalOceanForecasting

在超算集群中运行羲和遇到slurmstepd: error #3

在超算集群中运行羲和遇到slurmstepd: error