NCAR / container-dtc-nwp

End-to-end NWP systems in containers.
https://dtcenter.org/community-code/numerical-weather-prediction-nwp-containers
28 stars 13 forks source link

Build error on wps_wrf image #46

Closed fossell closed 2 years ago

fossell commented 2 years ago

When building the wps_wrf container, the following error occurs which is basically just the list command of the executables in the Dockerfile to check if they were built successfully. Since that command fails, we know the executables aren't built and then the image isn't built either because of that failure. (Note, this is a fresh clone of the main branch of the repo, no mods, no changes, just top of the repo).

=> ERROR [ 7/12] RUN ls /comsoftware/wrf/WRF-4.1.3/main/real.exe /comsoftware/wrf/WRF-4.1.3/main/wrf.exe 0.3s


[ 7/12] RUN ls /comsoftware/wrf/WRF-4.1.3/main/real.exe /comsoftware/wrf/WRF-4.1.3/main/wrf.exe:

10 0.235 ls: cannot access /comsoftware/wrf/WRF-4.1.3/main/real.exe: No such file or directory

10 0.235 ls: cannot access /comsoftware/wrf/WRF-4.1.3/main/wrf.exe: No such file or directory


I added a redirect ( |tee >& /comsoftware/wrf/log.out ) into the Dockerfile so that the docker build wouldn't fail on that RUN command and I could then use the image to bin/bash into the container and see what was going on. Interestingly, when I do this, it seems the wrf main libraries are built so when the Dockerfile moves on to build WPS, it does build WPS executable successfully. Anyway, while bin/bash-ed in the container I can see the first error in the compile log for wrf is:

time mpif90 -o nl_get_0_routines.o -c -O0 -w -ffree-form -ffree-line-length-none -fconvert=big-endian -frecord-marker=4 -I../dyn_em -I../dyn_nmm -I/comsoftware/wrf/WRF-4.1.3/external/esmf_time_f90 -I/comsoftware/wrf/WRF-4.1.3/main -I/comsoftware/wrf/WRF-4.1.3/external/io_netcdf -I/comsoftware/wrf/WRF-4.1.3/external/io_int -I/comsoftware/wrf/WRF-4.1.3/frame -I/comsoftware/wrf/WRF-4.1.3/share -I/comsoftware/wrf/WRF-4.1.3/phys -I/comsoftware/wrf/WRF-4.1.3/wrftladj -I/comsoftware/wrf/WRF-4.1.3/chem -I/comsoftware/wrf/WRF-4.1.3/inc -I/comsoftware/libs/netcdf/include yy0.f90 gfortran: fatal error: Killed signal terminated program f951 compilation terminated.

I tried to compile wrf manually inside the container and it successfully built the executables. So inside container is successful, outside container with docker build is unsuccessful. I can also build the upp image successfully, so seems to be just wps_wrf issue (Jamie reports all other containers build successfully).

I have also done a complete clean from scratch attempt, e.g. wiping out my qcow, doing a docker system prune, removing all images and layers and everything, fresh git clone of the repo, etc., and same behavior. I checked my disk space and plenty of space there so that wasn't the issue.

fossell commented 2 years ago

SUmmary of system and version specs:

Kate's specs: Mojave 10.14.6 and Docker Desktop 3.5.2. - FAIL Jamie's specs: Catalina and Docker Desktop 3.5.2. - FAIL Michelle's specs: Big Sur and Docker Desktop 3.5.1 - SUCCESS When Michelle upgraded her Docker Desktop to v3.5.2, wps_wrf build failed. (All used Docker Engine v20.10.7)

ISSUE: Upgrading to Docker Desktop v3.5.2. Need to submit a issue to Docker repo.

fossell commented 2 years ago

Also fails with same error on Big Sur and Docker Desktop v3.3.3 and Docker Engine 20.10.6.

fossell commented 2 years ago

Recent tests indicate this could just be a memory issue. Increasing the memory to at least 10gb has proven successful for a number of tests and retests of main branch and other feature branches. More testing to confirm.

fossell commented 2 years ago

All team members appear to have repeating successful builds of the wps_wrf image on MacOS when increasing the memory to at least 10GB. We assume this was the issue and is no resolved. Closing this issue.