drprojects / DeepViewAgg

[CVPR'22 Best Paper Finalist] Official PyTorch implementation of the method presented in "Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation"
Other
219 stars 24 forks source link

mm_files folder problem. #25

Closed okssi0291 closed 1 year ago

okssi0291 commented 1 year ago

Hi, I am trying to run scripts/train_kitti360.py

I met the following error as followed below [Error Log] Actullay I met this error before when trying to run kitti360_inference.ipynb. I resolved this issue by creating a folder called "mm_folder". It was because there are no mm_folder. But, now, trying to run scripts/train_kitti360.py, this issue popped up again even with a mm_folder created. I don't know where to create this folder for this script, or did I do something wrong? I've almost done my task with codes for reviewing your paper except this issue. Please give me any advice to resolve this. It certainly occurs while execuing at the part of the following code:

class APIModel(BaseModel): def forward(self, *args, **kwargs): features = self.backbone(self.input).x <==== here.

[Error Log]


.... {'mm_time': 0.001059} {'mm_time': 0.000477} {'mm_time': 0.000126} {'mm_time': 0.001301} Saving mm_files/in_feat_0.pt and mm_files/kernel_16_0.pt terminate called after throwing an instance of 'c10::Error' what(): [enforce fail at inline_container.cc:380] . PytorchStreamWriter failed writing file version: file write failed frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const) + 0x47 (0x7f03a85196a7 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: caffe2::serialize::PyTorchStreamWriter::valid(char const, char const) + 0xa2 (0x7f039139dc72 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #2: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const, unsigned long, bool) + 0xbf (0x7f039139e61f in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #3: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0xe1 (0x7f039139f141 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #4: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x115 (0x7f039139f935 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #5: + 0x3132245 (0x7f0392836245 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #6: torch::jit::ExportModule(torch::jit::Module const&, std::string const&, std::unordered_map<std::string, std::string, std::hash, std::equal_to, std::allocator<std::pair<std::string const, std::string> > > const&, bool, bool) + 0x374 (0x7f0392835114 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) .....

drprojects commented 1 year ago

Hi @okssi0291, thanks for using this project.

Could you please provide a bit more context about what you are trying to do. There is no reference to a mm_folder nor mm_file in the project. Is it something you have created ? Is it an error you ran into when using the code as I provided it or after making some modifications ? Are you trying to save a torch object to disk ?

okssi0291 commented 1 year ago

I am using podman container + anaconda + pip to build an environemt for DeepViewAgg. The source codes are exactly the same code as you provide via git repository. The commit number is cbb69cd7c628f535514cf58b46e22d17e1d5e7dd

I was not able to install this environment at once, so I build this envoronment step by step by resolving some issues. So, I tried to test it by running those ipynb files. I've run successfully synthetic_multimodal_dataset.ipynb, kitti360_visualization.ipynb, and kitti360_inference.ipynb with some modification like creating mm_files folder, copying missing data-sets to data_2d_raw folder such as 2013_05_28_drive_0008_sync , 2013_05_28_drive_0008_sync.

The configuraiton in train_kitti360.sh is like below.

Select you GPU

I_GPU=0

DATA_ROOT="/home/okssi/deep_learning/DeepViewAgg/dataset_nvme0n1p2/train_dataset" # set your dataset root directory, where the data was/will be downloaded EXP_NAME="My_awesome_KITTI-360_experiment" # whatever suits your needs TASK="segmentation" MODELS_CONFIG="${TASK}/multimodal/sparseconv3d" # family of multimodal models using the sparseconv3d backbone MODEL_NAME="Res16UNet34-PointPyramid-early-cityscapes-interpolate" # specific model name DATASET_CONFIG="${TASK}/multimodal/kitti360-sparse" TRAINING="kitti360_benchmark/sparseconv3d_rgb-pretrained-pyramid-0" # training configuration for discriminative learning rate on the model EPOCHS=60 CYLINDERS_PER_EPOCH=12000 # roughly speaking, 40 cylinders per window TRAINVAL=False # True to train on Train+Val (eg before submission) MINI=False # True to train on mini version of KITTI-360 (eg to debug) BATCH_SIZE=4 # 4 fits in a 32G V100. Can be increased at inference time, of course WORKERS=1 # adapt to your machine BASE_LR=0.1 # initial learning rate LR_SCHEDULER='multi_step_kitti360' # learning rate scheduler for 60 epochs EVAL_FREQUENCY=5 # frequency at which metrics will be computed on Val. The less the faster the training but the less points on your validation curves SUBMISSION=False # True if you want to generate files for a submission to the KITTI-360 3D semantic segmentation benchmark CHECKPOINT_DIR="" # optional path to an already-existing checkpoint. If provided, the training will resume where it was left

export SPARSE_BACKEND=torchsparse

okssi0291 commented 1 year ago

I forgot to answer your questions. Could you please provide a bit more context about what you are trying to do. ==> I am trying to reproduce your paper. That's it. Based on that, I will find out a new idea or some improvements. There is no reference to a mm_folder nor mm_file in the project. Is it something you have created ? ==> It's weired. It seems to save something in the folder. I didn't do anything, but created the folder. ==> The following log is from execuing notesbooks/kiiti360_inference.ipynb .... {'mm_time': 0.000399} {'mm_time': 0.000120} {'mm_time': 0.000781} Saving mm_files/in_feat_0.pt and mm_files/kernel_16_0.pt Saving mm_files/in_feat_1.pt and mm_files/kernel_17_1.pt Saving mm_files/in_feat_2.pt and mm_files/kernel_18_2.pt Saving mm_files/in_feat_3.pt and mm_files/kernel_19_3.pt .... ==> This indeed saves those files in mm_files folder. Is it an error you ran into when using the code as I provided it or after making some modifications ? ==> I am using the code AS IS. Are you trying to save a torch object to disk ? ==> I didn't intend, but it seems the framework or torch_points3d package tries to save something.

drprojects commented 1 year ago

That's strange, I don't know where these mm_files come from.

I've run successfully synthetic_multimodal_dataset.ipynb, kitti360_visualization.ipynb, and kitti360_inference.ipynb with some modification like creating mm_files folder

Can you share the full traceback of the error, please ? More precisely, I would need to know which part of the code is trying to save to disk. For example, I don't see where in the project the following prints could come from:

{'mm_time': 0.000126}
Saving mm_files/in_feat_0.pt and mm_files/kernel_16_0.pt

Additionally, can you please share the list of installed packages in your environment with the following:

pip list
okssi0291 commented 1 year ago

This is the list

Package Version


absl-py 1.4.0 aiofiles 22.1.0 aiosqlite 0.18.0 ansi2html 1.8.0 antlr4-python3-runtime 4.8 anyio 3.6.2 appdirs 1.4.4 argon2-cffi 20.1.0 ase 3.22.1 async-generator 1.10 attrs 20.2.0 Babel 2.12.1 backcall 0.2.0 backports.cached-property 1.0.2 backports.functools-lru-cache 1.6.1 bleach 3.1.5 brotlipy 0.7.0 cachetools 5.3.0 certifi 2022.12.7 cffi 1.14.2 chardet 3.0.4 charset-normalizer 3.1.0 click 8.1.3 cryptography 3.1 cycler 0.11.0 dash 2.8.1 dash-core-components 2.0.0 dash-html-components 2.0.0 dash-table 5.0.0 decorator 4.4.2 defusedxml 0.6.0 docker-pycreds 0.4.0 entrypoints 0.3 faiss-gpu 1.6.5 filelock 3.0.12 Flask 2.2.3 fonttools 4.38.0 gdown 3.12.2 gitdb 4.0.10 GitPython 3.1.31 google-auth 2.16.2 google-auth-oauthlib 0.4.6 googledrivedownloader 0.4 grpcio 1.51.3 h5py 3.8.0 hydra-core 1.1.0 idna 2.10 imageio 2.26.0 importlib-metadata 6.0.0 importlib-resources 5.12.0 ipykernel 5.3.4 ipython 7.18.1 ipython-genutils 0.2.0 ipywidgets 8.0.4 isodate 0.6.1 itsdangerous 2.1.2 jedi 0.15.2 Jinja2 3.1.2 joblib 1.2.0 json5 0.9.4 jsonpatch 1.32 jsonpointer 2.3 jsonschema 4.17.3 jupyter-client 6.1.7 jupyter-core 4.6.3 jupyter-dash 0.4.2 jupyter-events 0.6.3 jupyter-server 1.23.6 jupyter-server-fileid 0.8.0 jupyter-server-ydoc 0.6.1 jupyter-ydoc 0.2.2 jupyterlab 3.6.1 jupyterlab-pygments 0.1.1 jupyterlab-server 2.20.0 jupyterlab-widgets 3.0.5 kiwisolver 1.4.4 llvmlite 0.39.1 Markdown 3.4.1 MarkupSafe 2.1.2 matplotlib 3.5.3 MinkowskiEngine 0.4.3 mistune 0.8.4 nb-conda-kernels 2.2.4 nbclassic 0.5.3 nbclient 0.5.0 nbconvert 6.0.1 nbformat 5.0.7 nest-asyncio 1.5.6 networkx 2.6.3 notebook 6.1.4 notebook-shim 0.2.2 numba 0.56.4 numpy 1.21.6 oauthlib 3.2.2 omegaconf 2.1.2 opencv-python 4.7.0.72 packaging 20.4 pandas 1.3.5 pandocfilters 1.4.2 parso 0.5.2 pathtools 0.1.2 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 20.2.2 pkgutil-resolve-name 1.3.10 plotly 5.4.0 plyfile 0.7.4 prometheus-client 0.8.0 prompt-toolkit 3.0.7 protobuf 3.20.3 psutil 5.9.4 ptyprocess 0.6.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.20 Pygments 2.6.1 pykeops 1.4.2 pyOpenSSL 19.1.0 pyparsing 2.4.7 pypng 0.20220715.0 pyrsistent 0.19.3 PySocks 1.7.1 python-dateutil 2.8.1 python-json-logger 2.0.7 python-louvain 0.16 pytorch-metric-learning 2.0.1 pytz 2022.7.1 PyYAML 6.0 pyzmq 20.0.0 rdflib 6.2.0 requests 2.28.2 requests-oauthlib 1.3.1 retrying 1.3.4 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rsa 4.9 scikit-learn 1.0.2 scipy 1.7.3 seaborn 0.12.2 Send2Trash 1.5.0 sentry-sdk 1.16.0 setproctitle 1.3.2 setuptools 49.6.0.post20200814 six 1.15.0 smmap 5.0.0 sniffio 1.3.0 tenacity 8.2.2 tensorboard 2.11.2 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 terminado 0.8.3 testpath 0.4.4 threadpoolctl 3.1.0 tomli 2.0.1 torch 1.7.1+cu101 torch-cluster 1.5.9 torch-geometric 1.6.3 torch-points-kernels 0.6.10 torch-scatter 2.0.7 torch-sparse 0.6.9 torch-spline-conv 1.2.1 torchnet 0.0.4 torchsparse 1.3.0 torchvision 0.8.2+cu101 tornado 6.2 tqdm 4.59.0 traitlets 5.0.4 typing-extensions 4.5.0 urllib3 1.25.10 visdom 0.2.4 wandb 0.13.11 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.5.1 Werkzeug 2.2.3 wheel 0.35.1 widgetsnbextension 4.0.5 y-py 0.5.9 yacs 0.1.8 ypy-websocket 0.8.4 zipp 3.1.0

drprojects commented 1 year ago

Can you share the full traceback of the error, please ? More precisely, I would need to know which part of the code is trying to save to disk. For example, I don't see where in the project the following prints could come from:

{'mm_time': 0.000126}
Saving mm_files/in_feat_0.pt and mm_files/kernel_16_0.pt

Please share the full error traceback so I can see which line produces the error

drprojects commented 1 year ago

Also, you seem to have installed torch and torchvision for CUDA 10.1. This project has been tested with CUDA 10.2, 11.2 and 11.4, but not with this version. Not sure if this will cause other downstream issues, but you might want to upgrade your CUDA version.

drprojects commented 1 year ago

A quick googling of your error led me to this pytorch issue, which is likely the source of your problems.

okssi0291 commented 1 year ago

I tried to use nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04 in order to use one of compatible versions as you mentioned. But, I failed to install it because of segementation fault when trying "from torch_points3d.trainer import Trainer". The only container that looks working is the one I am using now, that's why I keep trying to resolve this issue with this container, not installing another image and install all the packages again. It's hard to build an environment for deepviewagg actually. Do you have any recommendation to build an envornment to run deepveiwagg easily? such as podman/docker image or something. I've spent quite much time working on this. I am getting tired of keep trying..

And this is the traceback. I was not able to figure out where it is trying to save either.

.... {'mm_time': 0.000651} {'mm_time': 0.000177} {'mm_time': 0.001817} Saving mm_files/in_feat_0.pt and mm_files/kernel_16_0.pt terminate called after throwing an instance of 'c10::Error' what(): [enforce fail at inline_container.cc:380] . PytorchStreamWriter failed writing file version: file write failed frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const) + 0x47 (0x7f405cff86a7 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: caffe2::serialize::PyTorchStreamWriter::valid(char const, char const) + 0xa2 (0x7f4098164c72 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #2: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const, unsigned long, bool) + 0xbf (0x7f409816561f in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #3: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0xe1 (0x7f4098166141 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #4: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x115 (0x7f4098166935 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #5: + 0x3132245 (0x7f40995fd245 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #6: torch::jit::ExportModule(torch::jit::Module const&, std::string const&, std::unordered_map<std::string, std::string, std::hash, std::equal_to, std::allocator<std::pair<std::string const, std::string> > > const&, bool, bool) + 0x374 (0x7f40995fc114 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #7: torch::serialize::OutputArchive::save_to(std::string const&) + 0x4a (0x7f4099810f9a in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #8: void torch::save<at::Tensor, std::string&>(at::Tensor const&, std::string&) + 0x122 (0x7f40305c9522 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torchsparse_backend.cpython-37m-x86_64-linux-gnu.so) frame #9: ConvolutionForwardGPU(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, bool) + 0xbb0 (0x7f40305c5a20 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torchsparse_backend.cpython-37m-x86_64-linux-gnu.so) frame #10: + 0x1dfc6 (0x7f40305b9fc6 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torchsparse_backend.cpython-37m-x86_64-linux-gnu.so) frame #11: + 0x1acc2 (0x7f40305b6cc2 in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torchsparse_backend.cpython-37m-x86_64-linux-gnu.so)

frame #20: THPFunction_apply(_object*, _object*) + 0x93d (0x7f40a77dd90d in /home/okssi/anaconda3/envs/deep_view_aggregation_rev2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) ./scripts/train_kitti360.sh: line 60: 973 Aborted (core dumped) python -W ignore train.py data=${DATASET_CONFIG} models=${MODELS_CONFIG} model_name=${MODEL_NAME} task=${TASK} training=${TRAINING} lr_scheduler=${LR_SCHEDULER} eval_frequency=${EVAL_FREQUENCY} data.sample_per_epoch=${CYLINDERS_PER_EPOCH} data.dataroot=${DATA_ROOT} data.train_is_trainval=${TRAINVAL} data.mini=${MINI} training.cuda=${I_GPU} training.batch_size=${BATCH_SIZE} training.epochs=${EPOCHS} training.num_workers=${WORKERS} training.optim.base_lr=${BASE_LR} training.wandb.log=True training.wandb.name=${EXP_NAME} tracker_options.make_submission=${SUBMISSION} training.checkpoint_dir=${CHECKPOINT_DIR} (deep_view_aggregation_rev2) okssi@5a20f4eecfeb:~/deep_learning/DeepViewAgg$
drprojects commented 1 year ago

A quick googling of your error led me to this pytorch issue, which is likely the source of your problems.

Have you checked this issue ? :point_up_2:

okssi0291 commented 1 year ago

Yes, I did. What's the purpose of it then? He mentioned, "Thanks, I saved my model to a nonexistent path." It indicates that he attempted to save a model in a folder that doesn't exist. However, I'm not explicitly trying to save a model. I'm uncertain if I'm unintentionally attempting to save a model. I'm simply using the provided code and updating the dataset path. You mentioned that the provided code doesn't specify the 'mm_files' folder. Is there any code that saves a model? In the case of 'kitti360_inference.ipynb', I created an 'mm_files' folder. It was successful, and it indeed created numerous files within 'mm_files'. However, I'm unsure where in the code it attempts to save those files.

okssi0291 commented 1 year ago

Dear Author,

I have discovered the reason for encountering those issues. The problem arose because I was using a normal user account. By switching to the root account and utilizing Anaconda, as well as installing the install.sh file of DeepViewAgg, I can now execute the script/train_kitti360.sh without any error messages. Thank you for your assistance; I truly appreciate it.

Best regards, Younghoon.

drprojects commented 1 year ago

Great ! It did sound like a path permission problem like in the mentioned issue. Closing this one, then.

Best, Damien