Open lixiangchun opened 6 years ago
@lixiangchun Below is an example. Hope it can help you.
manifest = [train:/dataset/aeon/V3D/ucf-extracted/train-index.csv, test:/dataset/aeon/V3D/ucf-extracted/test-index.csv] manifest_root = /dataset/aeon/V3D/ucf-extracted backend = gpu epochs = 10 batch_size = 32 eval_freq = 1 log = video-c3d.log output_file = video-c3d.hdf5 device_id = 0 data_dir = /dataset
@baojun-nervana Thanks for your help.
Error in running python3 examples/video-c3d/train.py:
Traceback (most recent call last):
File "/media/storage1/software/github/neon/examples/video-c3d/train.py", line 31, in <module>
parser = NeonArgparser(__doc__, default_config_files=config_files)
File "/usr/local/lib/python3.5/dist-packages/neon/util/argparser.py", line 80, in __init__
super(NeonArgparser, self).__init__(*args, **kwargs)
TypeError: __init__() got multiple values for argument 'add_config_file_help'
That might be an issue related to configargparse version. That occurs on the newest version of the configargparse. The requirements.txt file recommends to use the following version.
configargparse==0.10.0
Thanks, it works now.
However, I found that this repo only supports CPU or MLK as backend.The training process is very slow.
How to enable GPU as the backend for this repo?
@lixiangchun The example can run with GPU backend. What error did you see with gpu backend? you might need to install the gpu dependencies. https://github.com/NervanaSystems/neon/blob/master/gpu_requirements.txt
@baojun-nervana After installing all packages in gpu_requirements.txt, the GPU backend can be used; however, the following error occurs:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pycuda/tools.py", line 426, in context_dependent_memoize
return ctx_dict[cur_ctx][args]
KeyError: <pycuda._driver.Context object at 0x7f3534cbe450>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/storage1/software/github/neon/examples/video-c3d/train.py", line 57, in <module>
model.fit(train, optimizer=opt, num_epochs=args.epochs, cost=cost, callbacks=callbacks)
File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 183, in fit
self._epoch_fit(dataset, callbacks)
File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 205, in _epoch_fit
x = self.fprop(x)
File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 236, in fprop
res = self.layers.fprop(x, inference)
File "/usr/local/lib/python3.5/dist-packages/neon/layers/container.py", line 395, in fprop
x = l.fprop(x, inference=inference)
File "/usr/local/lib/python3.5/dist-packages/neon/layers/layer.py", line 1061, in fprop
bias=self.weight_bias, bsum=self.batch_sum, layer_op=self)
File "/usr/local/lib/python3.5/dist-packages/neon/backends/nervanagpu.py", line 1990, in fprop_conv
return self._execute_conv("fprop", layer, layer.fprop_kernels, repeat)
File "/usr/local/lib/python3.5/dist-packages/neon/backends/nervanagpu.py", line 2072, in _execute_conv
kernels.execute(repeat)
File "/usr/local/lib/python3.5/dist-packages/neon/backends/convolution.py", line 551, in execute
kernel = kernel_specs.get_kernel(self.kernel_name, self.kernel_options)
File "<decorator-gen-35>", line 2, in get_kernel
File "/usr/local/lib/python3.5/dist-packages/pycuda/tools.py", line 430, in context_dependent_memoize
result = func(*args)
File "/usr/local/lib/python3.5/dist-packages/neon/backends/kernel_specs.py", line 842, in get_kernel
run_command([ "ptxas -v -arch", arch, "-o", cubin_file, ptx_file ])
File "/usr/local/lib/python3.5/dist-packages/neon/backends/kernel_specs.py", line 785, in run_command
raise RuntimeError("Error(%d):\n%s\n%s" % (proc.returncode, cmd, err))
RuntimeError: Error(136):
ptxas -v -arch sm_61 -o /home/lixc/.cache/neon/kernels/cubin/sconv_direct_fprop_64x32_SN_bias.cubin /home/lixc/.cache/neon/kernels/ptx/sconv_direct_fprop_64x32_SN_bias.ptx
b'Floating point exception (core dumped)\n'
My train.cfg
is:
manifest = [train:/media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted/train-index.csv, test:/media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted/test-index.csv]
manifest_root = /media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted
backend = gpu
epochs = 10
batch_size = 16
eval_freq = 1
log = video-c3d.log
output_file = video-c3d.hdf5
device_id = 1
data_dir = train_output_dir
serialize = 1
Training was done via:
export LD_LIBRARY_PATH=/media/storage1/software/github/neon/mklml_lnx_2018.0.1.20171227/lib:$LD_LIBRARY_PATH
python3 /media/storage1/software/github/neon/examples/video-c3d/train.py -c train.cfg
@lixiangchun Are you using cuda9? I am using cuda8 and there was issue reported on cuda9.
$nvcc --version │· nvcc: NVIDIA (R) Cuda compiler driver │· Copyright (c) 2005-2016 NVIDIA Corporation │· Built on Tue_Jan_10_13:22:03_CST_2017 │· Cuda compilation tools, release 8.0, V8.0.61
@baojun-nervana Thanks. Yes, I use cuda9. Will go back to cuda8 and try again.
Any example about the content of
train.cfg
used invideo-c3d
?