facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.95k forks source link

The examples/resnet50_trainer.py example doesn't work #1197

Open jwatte opened 7 years ago

jwatte commented 7 years ago

The examples/resnet50_trainer.py example is supposed to show how to train a network across multiple GPUs. Unfortunately, running the example fails with a null blob:

jwatte@ripper:~/caffe2_notebooks$ python examples/resnet50_trainer.py --train_data tutorial_data/resnet_trainer/imagenet_cars_boats_train/ --test_data tutorial_data/resnet_trainer/imagenet_cars_boats_val/ --gpus 0,1 --num_epochs 10

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file. Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file. Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file. INFO:resnet50_trainer:Running on GPUs: [0, 1] INFO:resnet50_trainer:Using epoch size: 1500000 INFO:data_parallel_model:Parallelizing model for devices: [0, 1] INFO:data_parallel_model:Create input and model training operators INFO:data_parallel_model:Model for GPU : 0 INFO:data_parallel_model:Model for GPU : 1 INFO:data_parallel_model:Adding gradient operators INFO:data_parallel_model:Add gradient all-reduces for SyncSGD INFO:data_parallel_model:Post-iteration operators for updating params INFO:data_parallel_model:Calling optimizer builder function INFO:data_parallel_model:Add initial parameter sync WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() ----- WARNING:memonger:NOTE: Executing memonger to optimize gradient memory INFO:memonger:Memonger memory optimization took 0.0162951946259 secs WARNING:memonger:NOTE: Executing memonger to optimize gradient memory INFO:memonger:Memonger memory optimization took 0.0155470371246 secs INFO:resnet50_trainer:----- Create test net ---- INFO:data_parallel_model:Parallelizing model for devices: [0, 1] INFO:data_parallel_model:Create input and model training operators INFO:data_parallel_model:Model for GPU : 0 INFO:data_parallel_model:Model for GPU : 1 INFO:data_parallel_model:Parameter update function not defined --> only forward Traceback for operator 2 in network resnet50_test /usr/local/caffe2/python/helpers/conv.py:134 /usr/local/caffe2/python/helpers/conv.py:181 /usr/local/caffe2/python/brew.py:100 /usr/local/caffe2/python/models/resnet.py:234 examples/resnet50_trainer.py:302 /usr/local/caffe2/python/data_parallel_model.py:169 examples/resnet50_trainer.py:403 examples/resnet50_trainer.py:525 examples/resnet50_trainer.py:529 Traceback (most recent call last): File "examples/resnet50_trainer.py", line 529, in main() File "examples/resnet50_trainer.py", line 525, in main Train(args) File "examples/resnet50_trainer.py", line 406, in Train workspace.CreateNet(test_model.net) File "/usr/local/caffe2/python/workspace.py", line 147, in CreateNet StringifyProto(net), overwrite, File "/usr/local/caffe2/python/workspace.py", line 166, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at operator.cc:31] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/conv1_w

Is there any code example of an end-to-end working design-train-infer network in the caffe2 code base? Preferably with multi-GPU support?

aropan commented 7 years ago

Without --test_data tutorial_data/resnet_trainer/imagenet_cars_boats_val/ training model process works? If yes then try it.

jwatte commented 7 years ago

Yes, your suggestion works -- the problem seems to be that the blobs have not been created yet when running the training network initialization. Moving the training net initialization up front fixes it. Attaching a patch.

jwatte commented 7 years ago
diff --git a/caffe2/python/examples/resnet50_trainer.py b/caffe2/python/examples/resnet50_trainer.py
index c40b217d..54d6c171 100644
--- a/caffe2/python/examples/resnet50_trainer.py
+++ b/caffe2/python/examples/resnet50_trainer.py
@@ -365,6 +365,9 @@ def Train(args):
         cpu_device=args.use_cpu,
     )

+    # Run train model once to initialize blob names
+    workspace.RunNetOnce(train_model.param_init_net)
+
     # Add test model, if specified
     test_model = None
     if (args.test_data is not None):
@@ -405,7 +408,6 @@ def Train(args):
         workspace.RunNetOnce(test_model.param_init_net)
         workspace.CreateNet(test_model.net)

-    workspace.RunNetOnce(train_model.param_init_net)
     workspace.CreateNet(train_model.net)

     epoch = 0
lironmo commented 6 years ago

Where i can find C++ implementation like caffe2/python/examples/resnet50_trainer.py, with approximately same API's?

I want to compare the footprint and performance of caffe2 vs tiny-dnn at training on linux based system and android/ios