docker running problem - Githubissues

sadexcavator commented 2 years ago

hello, why I use docker to run train.py and the trianing process only runs 1 round, while I use colab to run , it runs 2 rounds, to be more specific, the colab running result contains 2 parts, dmtet_mesh and mesh, but there's only dmtet_mesh when using docker. sorry i currently haven't look all through your code, and I wonder what caused this problem. thank you for your time.

jmunkberg commented 2 years ago

It is hard to say without seeing the log from your run. After the first pass, we call xatlas to generate UV coordinates, then run a second optimization pass with fixed topology to fine tune the shape and materials. I speculate that the xatlas pass failed for your example. We have tested the provided docker setup https://github.com/NVlabs/nvdiffrec/#server-usage-through-docker extensively, so it should work fine, at least on the provided examples.

sadexcavator commented 2 years ago

It is hard to say without seeing the log from your run. After the first pass, we call xatlas to generate UV coordinates, then run a second optimization pass with fixed topology to fine tune the shape and materials. I speculate that the xatlas pass failed for your example. We have tested the provided docker setup https://github.com/NVlabs/nvdiffrec/#server-usage-through-docker extensively, so it should work fine, at least on the provided examples.

hi this is my running log of the example bob(I change the iter times in the config file just to see the final result sooner and i think it doesn't matter with this problem) plz see below

(base) wsr@cvlab-3:~/workspace/nvdiffrec-main$ docker run --gpus device=5 -v /home/wsr/workspace/nvdiffrec-main:/usr/src/rongtest -w /usr/src/rongtest nvdiffrec:v1 python tra--config configs/bob.json

============= == PyTorch == NVIDIA Release 22.01 (build 31424411) PyTorch Version 1.11.0a0+bfe5ad2

Copyright (c) 2014-2022 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for PyTorch. NVIDIA recommends the use of the following flags: docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Config / Flags: config configs/bob.json iter 500 batch 4 spp 1 layers 1 train_res [512, 512] display_res [512, 512] texture_res [1024, 1024] display_interval 0 save_interval 100 learning_rate [0.03, 0.003] min_roughness 0.08 custom_mip False random_textures True background white loss logl1 out_dir out3/bob ref_mesh data/bob/bob_tri.obj base_mesh None validate False mtl_override None dmtet_grid 64 mesh_scale 2.1 env_scale 2.0 envmap data/irrmaps/aerodynamics_workshop_2k.hdr display None camera_space_light False lock_light False lock_pos False sdf_regularizer 0.2 laplace relative laplace_scale 10000.0 pre_load True kd_min [0.0, 0.0, 0.0, 0.0] kd_max [1.0, 1.0, 1.0, 1.0] ks_min [0, 0.25, 0] ks_max [1.0, 1.0, 1.0] nrm_min [-1.0, -1.0, 0.0] nrm_max [1.0, 1.0, 1.0] cam_near_far [0.1, 1000.0] learn_light True local_rank 0 multi_gpu False DatasetMesh: ref mesh has 10688 triangles and 5344 vertices ---> WARNING: Picked a texture resolution lower than the reference mesh [1024, 1024] < [2048, 2048] Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/renderutils_plugin/build.ninja... Building extension module renderutils_plugin... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/8] c++ -MMD -MF common.o.d -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DP1_BUILD_ABI="cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/includetem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/cnclude/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -DNVDR_TORCH -c /usr/src/rongtest/render/renderutils/c_src/common.cpp -o common.o [2/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -D11_BUILD_ABI="cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/incluystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -DCUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --elaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /usr/src/rongtest/render/reils/c_src/mesh.cu -o mesh.cuda.o [3/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -D11_BUILD_ABI="cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/incluystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALFCONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --elaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /usr/src/rongtest/render/reils/c_src/normal.cu -o normal.cuda.o [4/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -D11_BUILD_ABI="cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/incluystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALFCONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --elaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /usr/src/rongtest/render/reils/c_src/loss.cu -o loss.cuda.o [5/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -D11_BUILD_ABI="cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/incluystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALFCONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --elaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /usr/src/rongtest/render/reils/c_src/cubemap.cu -o cubemap.cuda.o [6/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -D11_BUILD_ABI="cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/incluystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALFCONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --elaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /usr/src/rongtest/render/reils/c_src/bsdf.cu -o bsdf.cuda.o [7/8] c++ -MMD -MF torch_bindings.o.d -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcDPYBIND11_BUILD_ABI="_cxxabi1013" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/apide -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isyste/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -DNVDR_TORCH -c /usr/src/rongtest/render/renderutils/c_src/torch_bindings.cpp -o torch_bindings.o [8/8] c++ mesh.cuda.o loss.cuda.o bsdf.cuda.o normal.cuda.o cubemap.cuda.o common.o torch_bindings.o -shared -lcuda -lnvrtc -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o renderutils_plugin.so Loading extension module renderutils_plugin... DatasetMesh: ref mesh has 10688 triangles and 5344 vertices ---> WARNING: Picked a texture resolution lower than the reference mesh [1024, 1024] < [2048, 2048] Encoder output: 32 dims iter= 0, img_loss=0.421821, reg_loss=0.334144, lr=0.02999, time=265.7 ms, rem=2.21 m iter= 10, img_loss=0.166641, reg_loss=0.314939, lr=0.02985, time=273.5 ms, rem=2.23 m iter= 20, img_loss=0.071345, reg_loss=0.279273, lr=0.02971, time=273.3 ms, rem=2.19 m iter= 30, img_loss=0.053433, reg_loss=0.243493, lr=0.02957, time=274.9 ms, rem=2.15 m iter= 40, img_loss=0.038217, reg_loss=0.218297, lr=0.02944, time=275.4 ms, rem=2.11 m iter= 50, img_loss=0.025294, reg_loss=0.194842, lr=0.02930, time=276.4 ms, rem=2.07 m iter= 60, img_loss=0.022160, reg_loss=0.171530, lr=0.02917, time=277.3 ms, rem=2.03 m iter= 70, img_loss=0.018654, reg_loss=0.148734, lr=0.02903, time=279.0 ms, rem=2.00 m iter= 80, img_loss=0.017355, reg_loss=0.126169, lr=0.02890, time=279.2 ms, rem=1.95 m iter= 90, img_loss=0.015939, reg_loss=0.103857, lr=0.02877, time=279.3 ms, rem=1.91 m iter= 100, img_loss=0.015873, reg_loss=0.081553, lr=0.02864, time=279.0 ms, rem=1.86 m iter= 110, img_loss=0.014372, reg_loss=0.059365, lr=0.02851, time=279.8 ms, rem=1.82 m iter= 120, img_loss=0.012976, reg_loss=0.037142, lr=0.02837, time=281.6 ms, rem=1.78 m iter= 130, img_loss=0.011797, reg_loss=0.018180, lr=0.02824, time=282.1 ms, rem=1.74 m iter= 140, img_loss=0.010529, reg_loss=0.015976, lr=0.02811, time=283.2 ms, rem=1.70 m iter= 150, img_loss=0.009927, reg_loss=0.016029, lr=0.02798, time=282.8 ms, rem=1.65 m iter= 160, img_loss=0.008845, reg_loss=0.016047, lr=0.02786, time=283.9 ms, rem=1.61 m iter= 170, img_loss=0.008995, reg_loss=0.016048, lr=0.02773, time=282.7 ms, rem=1.55 m iter= 180, img_loss=0.008096, reg_loss=0.016034, lr=0.02760, time=283.5 ms, rem=1.51 m iter= 190, img_loss=0.007980, reg_loss=0.016021, lr=0.02747, time=282.8 ms, rem=1.46 m iter= 200, img_loss=0.007826, reg_loss=0.015982, lr=0.02735, time=283.2 ms, rem=1.42 m iter= 210, img_loss=0.007159, reg_loss=0.015964, lr=0.02722, time=282.3 ms, rem=1.36 m iter= 220, img_loss=0.007160, reg_loss=0.015953, lr=0.02710, time=281.6 ms, rem=1.31 m iter= 230, img_loss=0.006890, reg_loss=0.015927, lr=0.02697, time=282.9 ms, rem=1.27 m iter= 240, img_loss=0.006535, reg_loss=0.015904, lr=0.02685, time=282.7 ms, rem=1.23 m iter= 250, img_loss=0.006186, reg_loss=0.015873, lr=0.02673, time=282.9 ms, rem=1.18 m iter= 260, img_loss=0.006303, reg_loss=0.015866, lr=0.02660, time=282.1 ms, rem=1.13 m iter= 270, img_loss=0.006325, reg_loss=0.015859, lr=0.02648, time=281.9 ms, rem=1.08 m iter= 280, img_loss=0.006111, reg_loss=0.015851, lr=0.02636, time=284.1 ms, rem=1.04 m iter= 290, img_loss=0.005941, reg_loss=0.015837, lr=0.02624, time=284.0 ms, rem=59.63 s iter= 300, img_loss=0.006388, reg_loss=0.015840, lr=0.02612, time=283.1 ms, rem=56.62 s iter= 310, img_loss=0.006840, reg_loss=0.015851, lr=0.02600, time=284.1 ms, rem=53.98 s iter= 320, img_loss=0.006213, reg_loss=0.015852, lr=0.02588, time=284.2 ms, rem=51.16 s iter= 330, img_loss=0.005708, reg_loss=0.015842, lr=0.02576, time=284.7 ms, rem=48.39 s iter= 340, img_loss=0.005781, reg_loss=0.015855, lr=0.02564, time=284.2 ms, rem=45.46 s iter= 350, img_loss=0.005499, reg_loss=0.015855, lr=0.02552, time=283.8 ms, rem=42.58 s iter= 360, img_loss=0.005205, reg_loss=0.015831, lr=0.02541, time=285.0 ms, rem=39.90 s iter= 370, img_loss=0.005158, reg_loss=0.015843, lr=0.02529, time=284.5 ms, rem=36.99 s iter= 380, img_loss=0.005272, reg_loss=0.015843, lr=0.02517, time=283.8 ms, rem=34.05 s iter= 390, img_loss=0.005078, reg_loss=0.015829, lr=0.02506, time=284.3 ms, rem=31.28 s iter= 400, img_loss=0.005323, reg_loss=0.015844, lr=0.02494, time=283.8 ms, rem=28.38 s iter= 410, img_loss=0.005477, reg_loss=0.015849, lr=0.02483, time=283.7 ms, rem=25.53 s iter= 420, img_loss=0.005470, reg_loss=0.015850, lr=0.02471, time=284.2 ms, rem=22.74 s iter= 430, img_loss=0.005166, reg_loss=0.015857, lr=0.02460, time=284.3 ms, rem=19.90 s iter= 440, img_loss=0.005321, reg_loss=0.015870, lr=0.02449, time=284.1 ms, rem=17.04 s iter= 450, img_loss=0.005204, reg_loss=0.015876, lr=0.02437, time=284.5 ms, rem=14.23 s iter= 460, img_loss=0.004758, reg_loss=0.015886, lr=0.02426, time=285.3 ms, rem=11.41 s iter= 470, img_loss=0.004572, reg_loss=0.015879, lr=0.02415, time=284.1 ms, rem=8.52 s iter= 480, img_loss=0.004710, reg_loss=0.015875, lr=0.02404, time=284.3 ms, rem=5.69 s iter= 490, img_loss=0.004733, reg_loss=0.015862, lr=0.02393, time=284.3 ms, rem=2.84 s iter= 500, img_loss=0.004931, reg_loss=0.015849, lr=0.02382, time=283.8 ms, rem=0.00 s Base mesh has 10148 triangles and 5074 vertices. Writing mesh: out3/bob/dmtet_mesh/mesh.obj writing 5074 vertices writing 6216 texcoords writing 5074 normals writing 10148 faces Writing material: out3/bob/dmtet_mesh/mesh.mtl Done exporting mesh (base) wsr@cvlab-3:~/workspace/nvdiffrec-main$

As you can see, the program just stopped, no error reported. Appreciate your kind reply.

jmunkberg commented 2 years ago

Ok, so this is not due to xatlas, as you successfully saved the model from the first pass. I suspect there may be a memory issue.

What GPU is this running on?

Some things to try:

As suggested in the log:

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Add the flag --shm-size 16G to the docker run command.
For debugging with less memory usage, change to batch 1 and texture_res [64, 64] in the config
Add some prints around line 620 here https://github.com/NVlabs/nvdiffrec/blob/main/train.py#L620 and in the optimize_mesh function to see what it is that crashes.

sadexcavator commented 1 year ago

Ok, so this is not due to xatlas, as you successfully saved the model from the first pass. I suspect there may be a memory issue.

What GPU is this running on?

Some things to try:

As suggested in the log:
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
Add the flag --shm-size 16G to the docker run command.

For debugging with less memory usage, change to batch 1 and texture_res [64, 64] in the config

Add some prints around line 620 here https://github.com/NVlabs/nvdiffrec/blob/main/train.py#L620 and in the optimize_mesh function to see what it is that crashes.

hi, sorry it's been a while, i've been looking at your code recently. I tried to add the flag --shm-size 16G as you suggested, and also tried to add '--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ' as in the log said, and even tried with them combined together too, but they all failed to solve the problem, then I add two prints in the optimize_mesh function, around the for loop beginning https://github.com/NVlabs/nvdiffrec/blob/main/train.py#L381, like this-> ....... print("%s begin trainning..."%(pass_name)) for it, target in enumerate(dataloader_train):

# Mix randomized background into dataset image
target = prepare_batch(target, 'random')
print("test print")
........

then I found in the second pass, the 'test print' is not printed, and I get this from the log->

mesh_pass begin trainning... terminate called after throwing an instance of 'std::runtime_error' what(): Attempted to free arena memory that was not allocated.

do you have any idea about how to fix this? Really appreciate your help!

NVlabs / nvdiffrec

docker running problem #52