Closed nitthilan closed 4 years ago
Hi, have you run "python setup.py build_ext --inplace"? Is there any error?
I ran it and it did not throw any error. I get the following output. root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF# python setup.py build_ext --inplace running build_ext copying build/lib.linux-x86_64-3.6/fairnr/clib/_ext.cpython-36m-x86_64-linux-gnu.so -> fairnr/clib
Its not able to recognize clib. while this builds the _ext. I am running the train from the NSVF folder where the fairnr folder is present. Is this fine?
Thanking you.
Regards, K. J. Nitthilan
Yes, you should run the training under the NSVF folder. After compiling, do you find a file named "_ext.*.so" under "fairnr/clib"?
Yes. The file is present.
I was able to bypass this error by moving the file one step higher and then importing
import fairnr._ext as _ext
This then caused another issue with import fairnr.criterions.utils as utils. I copied the code locally instead of importing.
Seems like the problem is with cyclic importing i.e. we are addressing the module being imported within the module. With these two changes, I was able to run the code. However, it now seems to throw a runtime error.
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 336, in distributed_main main(args, init_distributed=True) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 102, in main should_end_training = train(args, trainer, task, epoch_itr) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, *kwds) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 179, in train log_output = trainer.train_step(samples) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, **kwds) File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 572, in train_step logging_outputs, sample_size, grad_norm, File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 935, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/opt/conda/lib/python3.6/site-packages/fairseq/tasks/fairseq_task.py", line 416, in reduce_metrics criterion.class.reduce_metrics(logging_outputs) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/criterions/rendering_loss.py", line 107, in reduce_metrics for w in logging_outputs[0] IndexError: list index out of range
Say you are under "NSVF" main folder and open a python command line.
Are you able to "import fairnr.clib" ? There is an __init__.py
file under "fairnr/clib", it should be able to find it though.
I was able to bypass this error by moving the file one step higher and then importing
import fairnr._ext as _ext
This then caused another issue with import fairnr.criterions.utils as utils. I copied the code locally instead of importing.
Seems like the problem is with cyclic importing i.e. we are addressing the module being imported within the module. With these two changes, I was able to run the code. However, it now seems to throw a runtime error.
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 336, in distributed_main main(args, init_distributed=True) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 102, in main should_end_training = train(args, trainer, task, epoch_itr) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, *kwds) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 179, in train log_output = trainer.train_step(samples) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, kwds) File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 572, in train_step logging_outputs, sample_size, grad_norm, File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 935, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/opt/conda/lib/python3.6/site-packages/fairseq/tasks/fairseq_task.py", line 416, in reduce_metrics criterion.class**.reduce_metrics(logging_outputs) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/criterions/rendering_loss.py", line 107, in reduce_metrics for w in logging_outputs[0] IndexError: list index out of range
Do you see out of memory when running the code?
Say you are under "NSVF" main folder and open a python command line. Are you able to "import fairnr.clib" ? There is an
__init__.py
file under "fairnr/clib", it should be able to find it though.
it works without the "import fairnr.clib._ext as _ext" line in the init.py file. However, it fails when it is added with the following error.
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import fairnr.clib Traceback (most recent call last): File "
", line 1, in File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/init.py", line 11, in from . import data, tasks, clib, modules, criterions, models File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/clib/init.py", line 28, in import fairnr.clib._ext as _ext AttributeError: module 'fairnr' has no attribute 'clib'
Nope
Hmm, strange. It shows fine on my side:
Our python versions are different. Looks like there is a fix for this in 3.7. I am using python 3.6. https://stackoverflow.com/questions/49051706/whats-new-in-python-3-7-about-circular-imports
Also, are you running your code on a multi-node GPU system? My setup is a single RTX 2080ti setup. It's a single node with two GPUs. So probably the logging need not do aggregation across nodes??
It supports both multi-node or single node systems with multiple GPUs. Aggregation across nodes will be automatically handled.
I use a single RTX 2080Ti and it yields OOM after 25k steps (I run train_wineholder.sh
without any modification)
2020-10-12 14:54:59 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2020-10-12 14:55:00 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.29 GiB (GPU 0; 10.76 GiB total capacity; 7.07 GiB already allocated; 1.29 GiB free; 8.10 GiB reserved in total by PyTorch)
2020-10-12 14:55:00 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 9768 | cudaMalloc retries: 9773 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 6576 MB | 8200 MB | 679762 GB | 679755 GB |
| from large pool | 6567 MB | 8183 MB | 675427 GB | 675420 GB |
| from small pool | 8 MB | 165 MB | 4335 GB | 4335 GB |
|---------------------------------------------------------------------------|
| Active memory | 6576 MB | 8200 MB | 679762 GB | 679755 GB |
| from large pool | 6567 MB | 8183 MB | 675427 GB | 675420 GB |
| from small pool | 8 MB | 165 MB | 4335 GB | 4335 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 8298 MB | 9626 MB | 37506 MB | 29208 MB |
| from large pool | 8274 MB | 9500 MB | 36962 MB | 28688 MB |
| from small pool | 24 MB | 170 MB | 544 MB | 520 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 1721 MB | 7918 MB | 555122 GB | 555120 GB |
| from large pool | 1706 MB | 7892 MB | 550588 GB | 550587 GB |
| from small pool | 15 MB | 46 MB | 4533 GB | 4533 GB |
|---------------------------------------------------------------------------|
| Allocations | 190 | 9044 | 94757 K | 94757 K |
| from large pool | 46 | 180 | 17255 K | 17255 K |
| from small pool | 144 | 9028 | 77501 K | 77501 K |
|---------------------------------------------------------------------------|
| Active allocs | 190 | 9044 | 94757 K | 94757 K |
| from large pool | 46 | 180 | 17255 K | 17255 K |
| from small pool | 144 | 9028 | 77501 K | 77501 K |
|---------------------------------------------------------------------------|
| GPU reserved segments | 21 | 176 | 516 | 495 |
| from large pool | 9 | 113 | 244 | 235 |
| from small pool | 12 | 85 | 272 | 260 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 40 | 330 | 53488 K | 53488 K |
| from large pool | 9 | 73 | 7639 K | 7639 K |
| from small pool | 31 | 319 | 45848 K | 45848 K |
|===========================================================================|
So maybe there is need to lower the batch size --view-per-batch 2 --pixel-per-view 2048
(these two?) if your GPU has memory <= that of RTX 2080Ti.
Yes, you need more GPU memory or change the batch size to 1. At 25k, it would split voxels and halve step size for next-round progressive training, so needs more GPU memory.
Our basic experiment was done on 32GB GPUs and it is necessary to adjust the batch sizes (--view-per-batch and --pixel-per-view) or the chunking size (--chunk-size/--valid-chunk-size) accordingly. For example, you can set --chunk-size 32 to reduce the size for forward.
Hi When I am training the network I find the following import error. Probably its a minor issue. Am I missing some step?
root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF# python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 4 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --use-octree --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 0 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log Traceback (most recent call last): File "train.py", line 7, in
from fairnr_cli.train import cli_main
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 26, in
from fairnr import ResetTrainerException
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/init.py", line 11, in
from . import data, tasks, models, modules, criterions, clib
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/init.py", line 15, in
module = importlib.import_module('fairnr.models.' + model_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/multi_nsvf.py", line 15, in
from fairnr.models.nsvf import NSVFModel, base_architecture
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/nsvf.py", line 24, in
from fairnr.models.fairnr_model import BaseModel
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/fairnr_model.py", line 22, in
from fairnr.modules.encoder import get_encoder
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/modules/init.py", line 15, in
module = importlib.import_module('fairnr.modules.' + model_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/modules/encoder.py", line 23, in
from fairnr.clib import (
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/clib/init.py", line 28, in
import fairnr.clib._ext as _ext
AttributeError: module 'fairnr' has no attribute 'clib'
root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF#