facebookresearch / NSVF

Open source code for the paper of Neural Sparse Voxel Fields.
MIT License
805 stars 92 forks source link

AttributeError: module 'fairnr' has no attribute 'clib' #6

Closed nitthilan closed 4 years ago

nitthilan commented 4 years ago

Hi When I am training the network I find the following import error. Probably its a minor issue. Am I missing some step?

root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF# python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 4 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --use-octree --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 0 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log Traceback (most recent call last): File "train.py", line 7, in from fairnr_cli.train import cli_main File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 26, in from fairnr import ResetTrainerException File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/init.py", line 11, in from . import data, tasks, models, modules, criterions, clib File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/init.py", line 15, in module = importlib.import_module('fairnr.models.' + model_name) File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/multi_nsvf.py", line 15, in from fairnr.models.nsvf import NSVFModel, base_architecture File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/nsvf.py", line 24, in from fairnr.models.fairnr_model import BaseModel File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/fairnr_model.py", line 22, in from fairnr.modules.encoder import get_encoder File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/modules/init.py", line 15, in module = importlib.import_module('fairnr.modules.' + model_name) File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/modules/encoder.py", line 23, in from fairnr.clib import ( File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/clib/init.py", line 28, in import fairnr.clib._ext as _ext AttributeError: module 'fairnr' has no attribute 'clib' root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF#

lingjie0206 commented 4 years ago

Hi, have you run "python setup.py build_ext --inplace"? Is there any error?

nitthilan commented 4 years ago

I ran it and it did not throw any error. I get the following output. root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF# python setup.py build_ext --inplace running build_ext copying build/lib.linux-x86_64-3.6/fairnr/clib/_ext.cpython-36m-x86_64-linux-gnu.so -> fairnr/clib

Its not able to recognize clib. while this builds the _ext. I am running the train from the NSVF folder where the fairnr folder is present. Is this fine?

Thanking you.

Regards, K. J. Nitthilan

MultiPath commented 4 years ago

Yes, you should run the training under the NSVF folder. After compiling, do you find a file named "_ext.*.so" under "fairnr/clib"?

nitthilan commented 4 years ago

Yes. The file is present.

nitthilan commented 4 years ago

I was able to bypass this error by moving the file one step higher and then importing

import fairnr._ext as _ext

This then caused another issue with import fairnr.criterions.utils as utils. I copied the code locally instead of importing.

Seems like the problem is with cyclic importing i.e. we are addressing the module being imported within the module. With these two changes, I was able to run the code. However, it now seems to throw a runtime error.

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 336, in distributed_main main(args, init_distributed=True) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 102, in main should_end_training = train(args, trainer, task, epoch_itr) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, *kwds) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 179, in train log_output = trainer.train_step(samples) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, **kwds) File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 572, in train_step logging_outputs, sample_size, grad_norm, File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 935, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/opt/conda/lib/python3.6/site-packages/fairseq/tasks/fairseq_task.py", line 416, in reduce_metrics criterion.class.reduce_metrics(logging_outputs) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/criterions/rendering_loss.py", line 107, in reduce_metrics for w in logging_outputs[0] IndexError: list index out of range

MultiPath commented 4 years ago

Say you are under "NSVF" main folder and open a python command line. Are you able to "import fairnr.clib" ? There is an __init__.py file under "fairnr/clib", it should be able to find it though.

MultiPath commented 4 years ago

I was able to bypass this error by moving the file one step higher and then importing

import fairnr._ext as _ext

This then caused another issue with import fairnr.criterions.utils as utils. I copied the code locally instead of importing.

Seems like the problem is with cyclic importing i.e. we are addressing the module being imported within the module. With these two changes, I was able to run the code. However, it now seems to throw a runtime error.

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 336, in distributed_main main(args, init_distributed=True) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 102, in main should_end_training = train(args, trainer, task, epoch_itr) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, *kwds) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 179, in train log_output = trainer.train_step(samples) File "/opt/conda/lib/python3.6/contextlib.py", line 52, in inner return func(args, kwds) File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 572, in train_step logging_outputs, sample_size, grad_norm, File "/opt/conda/lib/python3.6/site-packages/fairseq/trainer.py", line 935, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/opt/conda/lib/python3.6/site-packages/fairseq/tasks/fairseq_task.py", line 416, in reduce_metrics criterion.class**.reduce_metrics(logging_outputs) File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/criterions/rendering_loss.py", line 107, in reduce_metrics for w in logging_outputs[0] IndexError: list index out of range

Do you see out of memory when running the code?

nitthilan commented 4 years ago

Say you are under "NSVF" main folder and open a python command line. Are you able to "import fairnr.clib" ? There is an __init__.py file under "fairnr/clib", it should be able to find it though.

it works without the "import fairnr.clib._ext as _ext" line in the init.py file. However, it fails when it is added with the following error.

Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import fairnr.clib Traceback (most recent call last): File "", line 1, in File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/init.py", line 11, in from . import data, tasks, clib, modules, criterions, models File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/clib/init.py", line 28, in import fairnr.clib._ext as _ext AttributeError: module 'fairnr' has no attribute 'clib'

Nope

MultiPath commented 4 years ago

Hmm, strange. It shows fine on my side: image

nitthilan commented 4 years ago

Our python versions are different. Looks like there is a fix for this in 3.7. I am using python 3.6. https://stackoverflow.com/questions/49051706/whats-new-in-python-3-7-about-circular-imports

nitthilan commented 4 years ago

Also, are you running your code on a multi-node GPU system? My setup is a single RTX 2080ti setup. It's a single node with two GPUs. So probably the logging need not do aggregation across nodes??

MultiPath commented 4 years ago

It supports both multi-node or single node systems with multiple GPUs. Aggregation across nodes will be automatically handled.

kwea123 commented 4 years ago

I use a single RTX 2080Ti and it yields OOM after 25k steps (I run train_wineholder.sh without any modification)

2020-10-12 14:54:59 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2020-10-12 14:55:00 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.29 GiB (GPU 0; 10.76 GiB total capacity; 7.07 GiB already allocated; 1.29 GiB free; 8.10 GiB reserved in total by PyTorch)
2020-10-12 14:55:00 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 9768         |        cudaMalloc retries: 9773      |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    6576 MB |    8200 MB |  679762 GB |  679755 GB |
|       from large pool |    6567 MB |    8183 MB |  675427 GB |  675420 GB |
|       from small pool |       8 MB |     165 MB |    4335 GB |    4335 GB |
|---------------------------------------------------------------------------|
| Active memory         |    6576 MB |    8200 MB |  679762 GB |  679755 GB |
|       from large pool |    6567 MB |    8183 MB |  675427 GB |  675420 GB |
|       from small pool |       8 MB |     165 MB |    4335 GB |    4335 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    8298 MB |    9626 MB |   37506 MB |   29208 MB |
|       from large pool |    8274 MB |    9500 MB |   36962 MB |   28688 MB |
|       from small pool |      24 MB |     170 MB |     544 MB |     520 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |    1721 MB |    7918 MB |  555122 GB |  555120 GB |
|       from large pool |    1706 MB |    7892 MB |  550588 GB |  550587 GB |
|       from small pool |      15 MB |      46 MB |    4533 GB |    4533 GB |
|---------------------------------------------------------------------------|
| Allocations           |     190    |    9044    |   94757 K  |   94757 K  |
|       from large pool |      46    |     180    |   17255 K  |   17255 K  |
|       from small pool |     144    |    9028    |   77501 K  |   77501 K  |
|---------------------------------------------------------------------------|
| Active allocs         |     190    |    9044    |   94757 K  |   94757 K  |
|       from large pool |      46    |     180    |   17255 K  |   17255 K  |
|       from small pool |     144    |    9028    |   77501 K  |   77501 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |      21    |     176    |     516    |     495    |
|       from large pool |       9    |     113    |     244    |     235    |
|       from small pool |      12    |      85    |     272    |     260    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      40    |     330    |   53488 K  |   53488 K  |
|       from large pool |       9    |      73    |    7639 K  |    7639 K  |
|       from small pool |      31    |     319    |   45848 K  |   45848 K  |
|===========================================================================|

So maybe there is need to lower the batch size --view-per-batch 2 --pixel-per-view 2048 (these two?) if your GPU has memory <= that of RTX 2080Ti.

lingjie0206 commented 4 years ago

Yes, you need more GPU memory or change the batch size to 1. At 25k, it would split voxels and halve step size for next-round progressive training, so needs more GPU memory.

MultiPath commented 4 years ago

Our basic experiment was done on 32GB GPUs and it is necessary to adjust the batch sizes (--view-per-batch and --pixel-per-view) or the chunking size (--chunk-size/--valid-chunk-size) accordingly. For example, you can set --chunk-size 32 to reduce the size for forward.