Open JFA-Mbule opened 2 years ago
Without knowing the error, it's difficult to know what went wrong. There are several things though that I can guess
layout = x,y
where x*y*6=atmos_ncores
and for ice and ocean layout = x,y
where x*y=ocean_ncores
.Hi Thomas, thank you for your quick answer!
Initially, the error was about the "Partition Nodes Limit", that is,
*** JOB 10594601 CANCELLED AT 2022-07-31T16:00 DUE TO (PartitionNodeLimit) => Requested 66 Nodes on the Partition cptec, which have limits of ***
when I try to use the default configuration. And when I change this to 30 nodes or less, the run only break. So, when I make " sacct -j 10597390 --format=Jobname,partition,time,start,end,nnodes,state,nodelist,ncpus " command, I get these answers:
For 30 nodes:
runTest-E+ cptec 1-00:00:00 2022-08-04T08:01:12 2022-08-04T08:01:13 30 FAILED sdumont[6201,6+ 1440
batch 2022-08-04T08:01:12 2022-08-04T08:01:13 1 FAILED sdumont6201 48
For 10 nodes:
runTest-E+ cptec 1-00:00:00 2022-08-04T00:10:13 2022-08-04T00:10:16 10 FAILED sdumont[6266-6+ 480
batch 2022-08-04T00:10:13 2022-08-04T00:10:16 1 FAILED sdumont6266 48
For only 1 nodes:
runTest-E+ cptec 1-00:00:00 2022-08-04T00:25:10 2022-08-04T00:25:12 1 FAILED sdumont6240 48
batch 2022-08-04T00:25:10 2022-08-04T00:25:12 1 FAILED sdumont6240 48
However, seeing in your answer, maybe it's for forgetting to change the ocean_ncores in another list of names besides the one in ./ESM4_rundir/input.nml. I try to find the other namelist and change it to correct ncores.
Thomas, in the 1 point of your answer, y and x are the number of nodes (nnode, --nodes=y) and the number of cores per node (ncore_node, --ntasks-per-node=x), or the inverse, respectivily?!
in the input.nml
file, there are variables called layout
. The layout is an array of two integers. The integers are related to the number of cores, not nodes.
Again, the best/easiest strategy for running the model is to run it with the prescribed number of cores. It's difficult to change the number of cores, especially for the ocean.
I don't really understand the information you are showing me. It looks like you are asking me how many nodes you need, and I have never worked with your specific computer. I think you need help from someone local to figure out how to get the model running and how many nodes you need.
Ok, I'll go back to the default settings. So, what the x,y, and 6 numbers mean?
in the input.nml file there is &fv_core_nml layout = 12,24 in this case, is 12 the number of cores for the atmosphere and 24 for the ocean?
The fv_core_nml is referring to the atmosphere only.
layout = 12,24 means that you have 12*24*6 = 1728 cores
for the atmosphere. You multiply by 6 because the atmosphere is using the cubed-sphere, so there are 6 faces of the sphere. The layout will be the same for the land.
For Ice and Ocean, you will have a different layout. For the ocean and ice layout, you multiply the numbers together to get the number of cores. These are not on the cubed-sphere, so you don't multiply by 6. There should be an atmos_pes and ocean_pes that show you the number of cores for the atmosphere and ocean.
Oh, Thank you for your answer, Thomas!
Yes, there are atmos_pes and ocean_pes, which are 1728 and 1437, respectively.
I'm trying some slurm configurations, using the standard ESM4 cores configurations. If it runs, I will tell you.
Hi, am I again, Jaime. I managed to compile the model on our machine, as I mentioned in the previous issue. Now, I'm struggling to run it. I'm finding some problems to run the model with the default settings (existing) in our machine (to do the first tests to run).
The amount of nodes/cores available to run the model is large and It no all the time these quantities are available to run the model with the configuration presented on our machine. That is, running with more than 3100 colors (actually 3165, atmos_npes = 1728, 1728 core for atmospheric model, and ocean_npes = 1437, 1437 for ocean), as configured in the namelist (in floder ./ESM4_rundir/input.nml) and run script ( floder ./run/) of the model.
In the partition I have access to, each node has 48 cores, in fact, there are those cores quantities (there are about 90 nodes in the partition that I have access to), but I'm trying to test with just a few nodes (about 10 just to see how the model will behave). It seems to me, that 66 of 90 nodes, is a sufficient number of nodes to run the model with these default settings, however, the logistics of this are quite difficult. For that reason, I'm trying fewer nodes just as a test to see if the model will run and how it will behave. However, when I'm testing with a configuration of 10 knots or even less (I've tested less and more than 10, in this case, 30) the round breaks before it even starts. Could someone help me understand why?
Below is the bash script used to run the model:
Please, looking at these settings, can anyone see any errors that are causing the test runs to break?
Another thing, when I change the default values (atm_cores=1728 and ocn_cores=1437), should I also change the values in the namelist (in ./ESM4_rundir/input.nml)? I made this change. I don't know if this is what is causing the crash.
Thanks.