Ocean-Intelligent-Forecasting / XiHe-GlobalOceanForecasting

38 stars 3 forks source link

在超算集群中运行羲和遇到slurmstepd: error #3

Open aiwuyouxi opened 3 months ago

aiwuyouxi commented 3 months ago

在超算集群中运行羲和遇到slurmstepd: error

作者您好,感谢您分享的预训练模型和代码。

我的运行代码过程遵循 README.md 的设置。具体如下:

1、下载 input_data、 models、 output_data、 src、 pycdo.tar.gz,并且按照推荐文件夹的结构进行配置; 2、将pycdo.tar.gz提取到对应文件夹,并运行source bin/activate确保正确的python环境; 3、编辑在超算集群中的提交任务脚本xihe.slurm:

#!/bin/bash
#SBATCH --job-name=inference_job
#SBATCH --output=inference_job.out
#SBATCH --error=inference_job.err
#SBATCH --time=01:00:00
#SBATCH --partition=kshcnormal
#SBATCH --nodes=5  
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=5  
#SBATCH --mem=8G  

cd  work/Xihe/src

python inference.py --lead_day 1 --save_path output_data

然后出现错误:

INFO:root:Loading input data from work/Xihe/input_data/input_surface_data/input_surface_20190101.npy
/opt/gridview/slurm/spool_slurmd/job70169366/slurm_script: line 14: 21016 Killed                  python inference.py --lead_day 1 --save_path output_data
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=70169366.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

此外,还进行了下面的尝试来解决问题: 1、调整批处理大小大(如下脚本所示),仍然报同样的错误: inference.txt.txt

2、增加mem的值,报错:sbatch: error: Batch job submission failed: Job submission failed because too much memory was requested relative to the number of CPUs requested. The requested memory:CPU should be kept no more than DefMemPerCPU

希望作者能够提供帮助,感激不尽!