Closed mmaslennikov closed 5 years ago
Our tutorial only supports the docker workflow, and going forwards we envision that that will remain the case. We chose docker to avoid the numerous issues that arise from people having different setups on their individual machines. That being said I totally understand your reasons for not using docker, but just understand it will be a bit more of a challenge to get things up and running, and we aren't planning to support this now or in the future.
The nice thing about docker is that even if you don't want to run inside a docker container, the fact that there is a docker container for our code makes it very easy to track down all the dependencies, just look through the dockerfile.
The other thing that is likely to trip you up has to do with paths. You just need to ensure that your PYTHONPATH
and PATH
are correctly set. Have a look at the entrypoint and environment setup file
Good luck!
Regarding nvidia-docker not passing nvidia-smi test: try providing the cuda version. So for example:
nvidia-docker run --rm nvidia/cuda:9.1-base nvidia-smi
.
For me, running the latest nvidia/cuda
image returns the following error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
I'm guessing this an issue with the latest nvidia/cuda image which does not contain nvidia-smi.
Thank you @noorvir-a for your helpful note! Very helpful, this wasn’t on our radar.
Many thanks for your kind and detailed reply, even despite it is outside of the supported docker-based version. I tried to explore more and was able to lunch training, it takes ~20 minutes on 1080 GTX for the caterpillar object.
I overcame the following obstacles: (1) I am not clear with the "fully_conv" setting. I had to switch if off by specifying fully_conv=False during resnet initialization in https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/d6e7e8242ba72b7e53cee6703348de2d5ccc81e5/pytorch_segmentation_detection/models/resnet_dilated.py#L244. Otherwise, I got the error
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc.weight: copying a param of torch.Size([1000, 512, 1, 1]) from checkpoint, where the shape is torch.Size([1000, 512]) in current model.
The reason for the error is the line https://github.com/warmspringwinds/vision/blob/5e0a760fc847d55a4c1699410a14003452fa4581/torchvision/models/resnet.py#L153
self.fc = nn.Conv2d(512 * block.expansion, num_classes, 1)
which is called in /pytorch-segmentation-detection/vision/torchvision/models/resnet.py. The call produces an object with the mismatching self.fc.weight.shape==[1000,512], which is causing the error.
It looks like "fully_conv" was added later into /pytorch-segmentation-detection/vision/torchvision/models/resnet.py, since https://github.com/pytorch/vision/blob/v0.2.1/torchvision/models/resnet.py does not have it (I tried different versions of this file, not only 0.2.1)
(How important is fully_conv? Any other ideas for fixing it?)
(2) Switched off the tensor restructuring
https://github.com/warmspringwinds/vision/blob/5e0a760fc847d55a4c1699410a14003452fa4581/torchvision/models/resnet.py#L212 https://github.com/warmspringwinds/vision/blob/5e0a760fc847d55a4c1699410a14003452fa4581/torchvision/models/resnet.py#L213
if not self.fully_conv:
x = x.view(x.size(0), -1)
(3) Fixed /dense_correspondence/correspondence_tools/correspondence_augmentation.py https://github.com/RobotLocomotion/pytorch-dense-correspondence/blob/c3b068adca006b828248fdf16b00aa7603d462e2/dense_correspondence/correspondence_tools/correspondence_augmentation.py#L68 image.height -> images[0].height https://github.com/RobotLocomotion/pytorch-dense-correspondence/blob/c3b068adca006b828248fdf16b00aa7603d462e2/dense_correspondence/correspondence_tools/correspondence_augmentation.py#L82 image.width -> images[0].width
Regarding cuda (off-topic): I tried and got the following response
steve@gx501:~$ nvidia-docker run --rm nvidia/cuda:9.1-base nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
Hello,
Imho, reproducing without Docker is a good practice, also nvidia-docker does not pass the nvidia-smi test on my Ubuntu 18.04 using PyCharm 2018, Python 3.7 and Anaconda. So, I am trying to reproduce the system without Docker and would like to share my experience and get advice.
(0) I separated the data and project files into some and directories:
DIR_DATA = '/home/steve/steve/corpus/robot_locomotion'
DIR_PROJ = '/home/steve/experiments/pytorch-dense-correspondence'
(1) Downloaded the dataset files. The files are huge, so I chose to download Caterpillar
First, I downloaded the files with wget (for a single night) using/config/download_pdc_data.py. However, later I discovered that wget did not download tar.gz correctly. Hence, I had to redownload the files. So, I chose to download them using aria2c, which is 16x faster
I created the file caterpillar.txt: http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-14-40-25.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-14-42-26.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-14-44-53.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-14-49-22.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-15-23-41.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-15-25-38.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-15-28-45.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-15-30-50.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-14-46-36.tar.gz http://data.csail.mit.edu/labelfusion/pdccompressed/logs_proto/2018-04-16-17-35-21.tar.gz
Afterwards, I launched it as $ aria2c -x 16 -i caterpillar.txt
and manually unpacked into /code/data_volume/pdc/logs_proto. In my experience, downloading time per single file became ~3 minutes instead of ~30-35.
(3) Debugging is imho important and I like to use Pycharm in my work. So, I extracted training_tutorial.ipynb into training_tutorial.py and created simplified unit tests test_trainingTutorial.py. (as attached). I set the directories/dense_correspondence, /modules and /pytorch-segmentation-detection as the source directories.
(4) I am launching the train() function in training_tutorial.py. Currently, I am struggling with ResNet34, and it looks like the ResNet version is different (I used diffnow.com to compare). Essentially, variables fully_conv, remove_avg_pool_layer, output_stride=8, dilation do not exist in the current resnet.py.
When I debug the ResNet initialization line, I get the message "RuntimeError: Error(s) in loading state_dict for ResNet: size mismatch for fc.weight: copying a param of torch.Size([1000, 512, 1, 1]) from checkpoint, where the shape is torch.Size([1000, 512]) in current model."
I am checking out if this error appears due to some ResNet update. Also, why were you including your version of ResNet into/pytorch-segmentation-detection/vision/torchvision/models/resnet.py ? (you could probably inherit the ResNet class). Would you kindly hint why did you choose ResNet and not e.g. Inception or DenseNet?
It would be great if you could give your exposure or kind advice. Below are the files that I created.
==============/dense_correspondence/training/training_tutorial.py ==============
================/config/params.py ================
Thank you in advance