Closed elodie-kendall closed 2 years ago
I don't quite understand the problem. Can you describe exactly what you do?
Are you stopping and restarting the computation after a certain amount of time? It looks like you run from 900 to 966 or 967 and the code aborts before you restart again at 900. Can you increase the walltime? Or checkpoint every 50 timesteps?
Hi Timo. I am just running the model once- it is running to time step 966/967 and then I see in the log file it resets to time step 900. It stays as the same job name, and doesn't cancel/fail, it just keeps doing this loop
There is no place in the code that jumps from step 967 to step 900. Your output also shows that the code is resuming from a snapshot. That happens when you run aspect again. To me it looks like you are running several times and the code crashes/hangs/ runs out of wall time on between.
Sorry my mistake, you're right Timo it was just incredibly slow and not checkpointing enough before a timeout. Thanks!
Hi,
I am running some big global 3D shell models (which ran fine in 2D) which are running fine until a certain time step (966) and then the code resumes from Timestep 900 snapshot. This then happens continually and the model can't progress. All parameters look fine so maybe it is that the output files are becoming to large again. I am using the latest deal version (10.0-pre) with the latest ASPECT branch which should have fixed the by the >40GB issue raised earlier. The largest file in my output folder is restart.mesh_fixed.data, at 43GB. I attach my prm and log file.
Thanks, Elodie log.txt