Open jwatte opened 7 years ago
Without --test_data tutorial_data/resnet_trainer/imagenet_cars_boats_val/
training model process works? If yes then try it.
Yes, your suggestion works -- the problem seems to be that the blobs have not been created yet when running the training network initialization. Moving the training net initialization up front fixes it. Attaching a patch.
diff --git a/caffe2/python/examples/resnet50_trainer.py b/caffe2/python/examples/resnet50_trainer.py
index c40b217d..54d6c171 100644
--- a/caffe2/python/examples/resnet50_trainer.py
+++ b/caffe2/python/examples/resnet50_trainer.py
@@ -365,6 +365,9 @@ def Train(args):
cpu_device=args.use_cpu,
)
+ # Run train model once to initialize blob names
+ workspace.RunNetOnce(train_model.param_init_net)
+
# Add test model, if specified
test_model = None
if (args.test_data is not None):
@@ -405,7 +408,6 @@ def Train(args):
workspace.RunNetOnce(test_model.param_init_net)
workspace.CreateNet(test_model.net)
- workspace.RunNetOnce(train_model.param_init_net)
workspace.CreateNet(train_model.net)
epoch = 0
Where i can find C++ implementation like caffe2/python/examples/resnet50_trainer.py, with approximately same API's?
I want to compare the footprint and performance of caffe2 vs tiny-dnn at training on linux based system and android/ios
The examples/resnet50_trainer.py example is supposed to show how to train a network across multiple GPUs. Unfortunately, running the example fails with a null blob:
jwatte@ripper:~/caffe2_notebooks$
python examples/resnet50_trainer.py --train_data tutorial_data/resnet_trainer/imagenet_cars_boats_train/ --test_data tutorial_data/resnet_trainer/imagenet_cars_boats_val/ --gpus 0,1 --num_epochs 10
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file. Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file. Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file. INFO:resnet50_trainer:Running on GPUs: [0, 1] INFO:resnet50_trainer:Using epoch size: 1500000 INFO:data_parallel_model:Parallelizing model for devices: [0, 1] INFO:data_parallel_model:Create input and model training operators INFO:data_parallel_model:Model for GPU : 0 INFO:data_parallel_model:Model for GPU : 1 INFO:data_parallel_model:Adding gradient operators INFO:data_parallel_model:Add gradient all-reduces for SyncSGD INFO:data_parallel_model:Post-iteration operators for updating params INFO:data_parallel_model:Calling optimizer builder function INFO:data_parallel_model:Add initial parameter sync WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() ----- WARNING:memonger:NOTE: Executing memonger to optimize gradient memory INFO:memonger:Memonger memory optimization took 0.0162951946259 secs WARNING:memonger:NOTE: Executing memonger to optimize gradient memory INFO:memonger:Memonger memory optimization took 0.0155470371246 secs INFO:resnet50_trainer:----- Create test net ---- INFO:data_parallel_model:Parallelizing model for devices: [0, 1] INFO:data_parallel_model:Create input and model training operators INFO:data_parallel_model:Model for GPU : 0 INFO:data_parallel_model:Model for GPU : 1 INFO:data_parallel_model:Parameter update function not defined --> only forward Traceback for operator 2 in network resnet50_test /usr/local/caffe2/python/helpers/conv.py:134 /usr/local/caffe2/python/helpers/conv.py:181 /usr/local/caffe2/python/brew.py:100 /usr/local/caffe2/python/models/resnet.py:234 examples/resnet50_trainer.py:302 /usr/local/caffe2/python/data_parallel_model.py:169 examples/resnet50_trainer.py:403 examples/resnet50_trainer.py:525 examples/resnet50_trainer.py:529 Traceback (most recent call last): File "examples/resnet50_trainer.py", line 529, in
main()
File "examples/resnet50_trainer.py", line 525, in main
Train(args)
File "examples/resnet50_trainer.py", line 406, in Train
workspace.CreateNet(test_model.net)
File "/usr/local/caffe2/python/workspace.py", line 147, in CreateNet
StringifyProto(net), overwrite,
File "/usr/local/caffe2/python/workspace.py", line 166, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:31] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/conv1_w
Is there any code example of an end-to-end working design-train-infer network in the caffe2 code base? Preferably with multi-GPU support?