INFO:root:Loading input data from work/Xihe/input_data/input_surface_data/input_surface_20190101.npy
/opt/gridview/slurm/spool_slurmd/job70169366/slurm_script: line 14: 21016 Killed python inference.py --lead_day 1 --save_path output_data
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=70169366.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
2、增加mem的值,报错:sbatch: error: Batch job submission failed: Job submission failed because too much memory was requested relative to the number of CPUs requested. The requested memory:CPU should be kept no more than DefMemPerCPU
在超算集群中运行羲和遇到slurmstepd: error
作者您好,感谢您分享的预训练模型和代码。
我的运行代码过程遵循 README.md 的设置。具体如下:
1、下载 input_data、 models、 output_data、 src、 pycdo.tar.gz,并且按照推荐文件夹的结构进行配置; 2、将pycdo.tar.gz提取到对应文件夹,并运行source bin/activate确保正确的python环境; 3、编辑在超算集群中的提交任务脚本xihe.slurm:
然后出现错误:
此外,还进行了下面的尝试来解决问题: 1、调整批处理大小大(如下脚本所示),仍然报同样的错误: inference.txt.txt
2、增加mem的值,报错:sbatch: error: Batch job submission failed: Job submission failed because too much memory was requested relative to the number of CPUs requested. The requested memory:CPU should be kept no more than DefMemPerCPU
希望作者能够提供帮助,感激不尽!