AndreWeiner / ml-cfd-lecture

Lecture material for machine learning applied to computational fluid mechanics
GNU General Public License v3.0
330 stars 116 forks source link

Exercise 3 Error with ./Allrun in renumbermesh #33

Closed hyzahw closed 9 months ago

hyzahw commented 9 months ago

Hi,

I followed the steps in tutorial 3 to run the cylinder case on OpenFOAM. The ./Allrun worked fine with blockMesh and decomposePar but came out with an error in renumberMesh and consequently with pimpleFoam.

The reumberMesh log file shows the following:

INFO: squashfuse not found, will not be able to mount SIF or other squashfs files INFO: fuse2fs not found, will not be able to mount EXT3 filesystems INFO: gocryptfs not found, will not be able to use gocryptfs INFO: Converting SIF file to temporary sandbox... INFO: squashfuse not found, will not be able to mount SIF or other squashfs files INFO: fuse2fs not found, will not be able to mount EXT3 filesystems INFO: gocryptfs not found, will not be able to use gocryptfs INFO: Converting SIF file to temporary sandbox... FATAL: while extracting /home/hyzahw/MLCFD/ml-cfd-lecture/of2206-py1.12.1-cpu.sif: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating WARNING: group file doesn't exist in container, not updating WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/tmp [tmp]: /tmp doesn't exist in container WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

FATAL ERROR:write_file: failed to create file /image/root/opt/libtorch/include/ATen/ops/combinations_compositeimplicitautograd_dispatch.h, because Too many open files Parallel unsquashfs: Using 8 processors 74903 inodes (86780 blocks) to write

: exit status 1 FATAL: while extracting /home/hyzahw/MLCFD/ml-cfd-lecture/of2206-py1.12.1-cpu.sif: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating WARNING: group file doesn't exist in container, not updating WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/tmp [tmp]: /tmp doesn't exist in container WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

FATAL ERROR:write_file: failed to create file /image/root/opt/libtorch/include/ATen/ops/as_strided_copy.h, because Too many open files Parallel unsquashfs: Using 8 processors 74903 inodes (86780 blocks) to write

: exit status 1


The first 4 lines appear also in my log files for blockMesh and decomposePar, I am not aware if this is normal or not but I believe it has something to do with the Fatal error below.

Thank you in advance!

JanisGeise commented 9 months ago

Hi @hyzahw,

it looks like an issue with the Apptainer container to me. In order to help you better, can you check the following:

  1. are you using Linux as native OS or are you using WSL (Linux subsystem for Windows)?
  2. did you follow the setup explained in exercise 2, especially the part regarding Apptainer and building the container?
  3. is the container located at the top-level of the repository?

Regards, Janis

hyzahw commented 9 months ago

Hi Janis,

Thanks for your answer. I am using Ubuntu as a native OS. I checked again the whole steps of installing Apptainer and I manage to not get the errors now concerning squashfuse and fuse2fs so I guess the Apptainer problem is solved.

I get now this error when starting the runParallel renumberMesh step in ./Allrun:

--> FOAM FATAL ERROR: (openfoam-2206)
attempt to run parallel on 1 processor

    From static bool Foam::UPstream::init(int&, char**&, bool)
    in file UPstream.C at line 286.

FOAM aborting

I tried ./Allrun command in another tutorial that ran also in parallel. decomposePar was executed normally also as before and the same error came once the runParallel step was executed. Do you have any recommendations?

JanisGeise commented 9 months ago

This may be an issue with MPI, can you check which version is installed by executing the command mpiexec --version in a terminal? You can further check if the simulation is executed when running it on only one CPU. Therefore, just change the Allrun script from:

# decompose and run case
runApplication decomposePar
runParallel renumberMesh -overwrite
runParallel $(getApplication)

to:

# decompose and run case
# runApplication decomposePar
# runParallel renumberMesh -overwrite
runApplication $(getApplication)

Edit: I get this error when numberOfSubdomains in system/decomposedParDict is set to 1. Can you check if this parameter is set to 2 in your case? It should look like: numberOfSubdomains 2;

Regards Janis

hyzahw commented 9 months ago

I was using mpi version 3.3.2 when I encountered the problem. Now I upgraded to 5.0.1 and I still get the same error. Can you mention the version that you are using? Should there be a compatibility between OF2206 and a specific MPI version?

The case runs normally if I don't decompose.

The decomposeParDict is similar to the original file in test_cases

numberOfSubdomains  2;

method              hierarchical;

coeffs
{
    n               (2 1 1);
}
JanisGeise commented 9 months ago

Hi @hyzahw,

I have MPI version 4.0.3 installed, as far as I know the Apptainer container uses MPI version 4.1.2. I have encountered these issues with executing simulations when there is a too large difference in between MPI version in the past.

If it is not too much trouble, you can try to install the MPI version used in your container (you can check which version is used by executing mpiexec --version inside the container). Alternatively, maybe @AndreWeiner has some additional ideas or tips which may help you with the issue.

Regards Janis

hyzahw commented 9 months ago

After re-installing MPI version 4.1.2, the simulation was run in parallel normally.

Thanks @JanisGeise! ;)