Reproducing without docker

mmaslennikov commented 5 years ago

Hello,

Imho, reproducing without Docker is a good practice, also nvidia-docker does not pass the nvidia-smi test on my Ubuntu 18.04 using PyCharm 2018, Python 3.7 and Anaconda. So, I am trying to reproduce the system without Docker and would like to share my experience and get advice.

(0) I separated the data and project files into some and directories: DIR_DATA = '/home/steve/steve/corpus/robot_locomotion' DIR_PROJ = '/home/steve/experiments/pytorch-dense-correspondence'

(1) Downloaded the dataset files. The files are huge, so I chose to download Caterpillar

First, I downloaded the files with wget (for a single night) using /config/download_pdc_data.py. However, later I discovered that wget did not download tar.gz correctly. Hence, I had to redownload the files. So, I chose to download them using aria2c, which is 16x faster

Afterwards, I launched it as $ aria2c -x 16 -i caterpillar.txt

and manually unpacked into /code/data_volume/pdc/logs_proto. In my experience, downloading time per single file became ~3 minutes instead of ~30-35.

(3) Debugging is imho important and I like to use Pycharm in my work. So, I extracted training_tutorial.ipynb into training_tutorial.py and created simplified unit tests test_trainingTutorial.py. (as attached). I set the directories /dense_correspondence, /modules and /pytorch-segmentation-detection as the source directories.

(4) I am launching the train() function in training_tutorial.py. Currently, I am struggling with ResNet34, and it looks like the ResNet version is different (I used diffnow.com to compare). Essentially, variables fully_conv, remove_avg_pool_layer, output_stride=8, dilation do not exist in the current resnet.py.

When I debug the ResNet initialization line, I get the message "RuntimeError: Error(s) in loading state_dict for ResNet: size mismatch for fc.weight: copying a param of torch.Size([1000, 512, 1, 1]) from checkpoint, where the shape is torch.Size([1000, 512]) in current model."

I am checking out if this error appears due to some ResNet update. Also, why were you including your version of ResNet into /pytorch-segmentation-detection/vision/torchvision/models/resnet.py ? (you could probably inherit the ResNet class). Would you kindly hint why did you choose ResNet and not e.g. Inception or DenseNet?

It would be great if you could give your exposure or kind advice. Below are the files that I created.

============== /dense_correspondence/training/training_tutorial.py ==============

import os
import sys
from config.params import *
import logging

import modules.dense_correspondence_manipulation.utils.utils as utils
from dense_correspondence.training.training import *
from dense_correspondence.training.training import DenseCorrespondenceTraining
from dense_correspondence.dataset.spartan_dataset_masked import SpartanDataset
from dense_correspondence.evaluation.evaluation import DenseCorrespondenceEvaluation

class TrainingTutorial:
    def __init__(self):
        os.chdir(DIR_PROJ)
        sys.path.append(os.path.join(DIR_PROJ, 'modules'))
        os.environ['DC_SOURCE_DIR'] = DIR_DATA

        utils.add_dense_correspondence_to_python_path()
        logging.basicConfig(level=logging.INFO)

        self.load_configuration()

    def load_configuration(self):
        # config_filename = os.path.join(utils.getDenseCorrespondenceSourceDir(), 'config', 'dense_correspondence',
        config_filename = os.path.join(DIR_PROJ, 'config', 'dense_correspondence',
                                       'dataset', 'composite', 'caterpillar_only_9.yaml')
        config = utils.getDictFromYamlFilename(config_filename)

        # train_config_file = os.path.join(utils.getDenseCorrespondenceSourceDir(), 'config', 'dense_correspondence',
        train_config_file = os.path.join(DIR_PROJ, 'config', 'dense_correspondence',
                                         'training', 'training.yaml')

        self.train_config = utils.getDictFromYamlFilename(train_config_file)
        self.dataset = SpartanDataset(config=config)

        logging_dir = "code/data_volume/pdc/trained_models/tutorials"
        num_iterations = 3500
        descr_dim = 3  # the descriptor dimension
        self.train_config["training"]["logging_dir_name"] = "caterpillar_%d" % (descr_dim)
        self.train_config["training"]["logging_dir"] = logging_dir
        self.train_config["dense_correspondence_network"]["descriptor_dimension"] = descr_dim
        self.train_config["training"]["num_iterations"] = num_iterations

    def train(self):
        # This should take about ~12-15 minutes with a GTX 1080 Ti

        # All of the saved data for this network will be located in the
        # code/data_volume/pdc/trained_models/tutorials/caterpillar_3 folder

        descr_dim = self.train_config["dense_correspondence_network"]["descriptor_dimension"]
        print("training descriptor of dimension %d" % (descr_dim))
        train = DenseCorrespondenceTraining(dataset=self.dataset, config=self.train_config)
        train.run()
        print("finished training descriptor of dimension %d" % (descr_dim))

    def evaluate(self):
        logging_dir = self.train_config["training"]["logging_dir"]
        logging_dir_name = self.train_config["training"]["logging_dir_name"]
        model_folder = os.path.join(logging_dir, logging_dir_name)
        model_folder = utils.convert_to_absolute_path(model_folder)

        DCE = DenseCorrespondenceEvaluation
        num_image_pairs = 100
        DCE.run_evaluation_on_network(model_folder, num_image_pairs=num_image_pairs)

================ /config/params.py ================

DIR_DATA = '/home/steve/steve/corpus/robot_locomotion'
DIR_PROJ = '/home/steve/experiments/pytorch-dense-correspondence'

Thank you in advance

manuelli commented 5 years ago

Our tutorial only supports the docker workflow, and going forwards we envision that that will remain the case. We chose docker to avoid the numerous issues that arise from people having different setups on their individual machines. That being said I totally understand your reasons for not using docker, but just understand it will be a bit more of a challenge to get things up and running, and we aren't planning to support this now or in the future.

The nice thing about docker is that even if you don't want to run inside a docker container, the fact that there is a docker container for our code makes it very easy to track down all the dependencies, just look through the dockerfile.

The other thing that is likely to trip you up has to do with paths. You just need to ensure that your PYTHONPATH and PATH are correctly set. Have a look at the entrypoint and environment setup file

Good luck!

noorvir commented 5 years ago

Regarding nvidia-docker not passing nvidia-smi test: try providing the cuda version. So for example:

nvidia-docker run --rm nvidia/cuda:9.1-base nvidia-smi.

For me, running the latest nvidia/cuda image returns the following error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.

I'm guessing this an issue with the latest nvidia/cuda image which does not contain nvidia-smi.

peteflorence commented 5 years ago

Thank you @noorvir-a for your helpful note! Very helpful, this wasn’t on our radar.

mmaslennikov commented 5 years ago

Many thanks for your kind and detailed reply, even despite it is outside of the supported docker-based version. I tried to explore more and was able to lunch training, it takes ~20 minutes on 1080 GTX for the caterpillar object.

I overcame the following obstacles: (1) I am not clear with the "fully_conv" setting. I had to switch if off by specifying fully_conv=False during resnet initialization in https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/d6e7e8242ba72b7e53cee6703348de2d5ccc81e5/pytorch_segmentation_detection/models/resnet_dilated.py#L244. Otherwise, I got the error

RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc.weight: copying a param of torch.Size([1000, 512, 1, 1]) from checkpoint, where the shape is torch.Size([1000, 512]) in current model.

The reason for the error is the line https://github.com/warmspringwinds/vision/blob/5e0a760fc847d55a4c1699410a14003452fa4581/torchvision/models/resnet.py#L153

self.fc = nn.Conv2d(512 * block.expansion, num_classes, 1)

which is called in /pytorch-segmentation-detection/vision/torchvision/models/resnet.py. The call produces an object with the mismatching self.fc.weight.shape==[1000,512], which is causing the error.

It looks like "fully_conv" was added later into /pytorch-segmentation-detection/vision/torchvision/models/resnet.py, since https://github.com/pytorch/vision/blob/v0.2.1/torchvision/models/resnet.py does not have it (I tried different versions of this file, not only 0.2.1)

(How important is fully_conv? Any other ideas for fixing it?)

(2) Switched off the tensor restructuring

https://github.com/warmspringwinds/vision/blob/5e0a760fc847d55a4c1699410a14003452fa4581/torchvision/models/resnet.py#L212 https://github.com/warmspringwinds/vision/blob/5e0a760fc847d55a4c1699410a14003452fa4581/torchvision/models/resnet.py#L213

if not self.fully_conv:
    x = x.view(x.size(0), -1)

(3) Fixed /dense_correspondence/correspondence_tools/correspondence_augmentation.py https://github.com/RobotLocomotion/pytorch-dense-correspondence/blob/c3b068adca006b828248fdf16b00aa7603d462e2/dense_correspondence/correspondence_tools/correspondence_augmentation.py#L68 image.height -> images[0].height https://github.com/RobotLocomotion/pytorch-dense-correspondence/blob/c3b068adca006b828248fdf16b00aa7603d462e2/dense_correspondence/correspondence_tools/correspondence_augmentation.py#L82 image.width -> images[0].width

Regarding cuda (off-topic): I tried and got the following response

steve@gx501:~$ nvidia-docker run --rm nvidia/cuda:9.1-base nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.

RobotLocomotion / pytorch-dense-correspondence

Reproducing without docker #178