NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
791 stars 165 forks source link

🐛[BUG]: Too many open files error in ahmed_body_mgn example #551

Open naruhikot opened 2 weeks ago

naruhikot commented 2 weeks ago

Version

24.04

On which installation method(s) does this occur?

Docker

Describe the issue

Converted docker container to apptainer. When I ran training script, I got "too many open files" error. See the trimmed log bellow.

Minimum reproducible example

apptainer run --nv --bind `pwd`,$USER_DIR $CONTAINER_IMAGE bash -c "python train.py"

Relevant log output

...
OSError: [Errno 24] Too many open files: '/usr/local/lib/python3.10/dist-packages/pandas/core/api.py'
OSError: [Errno 24] Too many open files: '/usr/local/lib/python3.10/dist-packages/pandas/core/algorithms.py'
OSError: [Errno 24] Too many open files: '/usr/local/lib/python3.10/dist-packages/pandas/core/config_init.py'
...

Environment details

Environment location: TSUBAME4.0 (https://www.t4.gsic.titech.ac.jp/en)
ulimit -n is set to 16384 on this system