NervanaSystems / neon

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware
http://neon.nervanasys.com/docs/latest
Apache License 2.0
3.87k stars 811 forks source link

train.cfg in video-c3d #448

Open lixiangchun opened 6 years ago

lixiangchun commented 6 years ago

Any example about the content of train.cfg used in video-c3d?

baojun-nervana commented 6 years ago

@lixiangchun Below is an example. Hope it can help you.

manifest = [train:/dataset/aeon/V3D/ucf-extracted/train-index.csv, test:/dataset/aeon/V3D/ucf-extracted/test-index.csv] manifest_root = /dataset/aeon/V3D/ucf-extracted backend = gpu epochs = 10 batch_size = 32 eval_freq = 1 log = video-c3d.log output_file = video-c3d.hdf5 device_id = 0 data_dir = /dataset

lixiangchun commented 6 years ago

@baojun-nervana Thanks for your help.

Error in running python3 examples/video-c3d/train.py:

Traceback (most recent call last):
  File "/media/storage1/software/github/neon/examples/video-c3d/train.py", line 31, in <module>
    parser = NeonArgparser(__doc__, default_config_files=config_files)
  File "/usr/local/lib/python3.5/dist-packages/neon/util/argparser.py", line 80, in __init__
    super(NeonArgparser, self).__init__(*args, **kwargs)
TypeError: __init__() got multiple values for argument 'add_config_file_help'
baojun-nervana commented 6 years ago

That might be an issue related to configargparse version. That occurs on the newest version of the configargparse. The requirements.txt file recommends to use the following version.

configargparse==0.10.0

lixiangchun commented 6 years ago

Thanks, it works now.

However, I found that this repo only supports CPU or MLK as backend.The training process is very slow.

How to enable GPU as the backend for this repo?

baojun-nervana commented 6 years ago

@lixiangchun The example can run with GPU backend. What error did you see with gpu backend? you might need to install the gpu dependencies. https://github.com/NervanaSystems/neon/blob/master/gpu_requirements.txt

lixiangchun commented 6 years ago

@baojun-nervana After installing all packages in gpu_requirements.txt, the GPU backend can be used; however, the following error occurs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pycuda/tools.py", line 426, in context_dependent_memoize
    return ctx_dict[cur_ctx][args]
KeyError: <pycuda._driver.Context object at 0x7f3534cbe450>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/storage1/software/github/neon/examples/video-c3d/train.py", line 57, in <module>
    model.fit(train, optimizer=opt, num_epochs=args.epochs, cost=cost, callbacks=callbacks)
  File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 183, in fit
    self._epoch_fit(dataset, callbacks)
  File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 205, in _epoch_fit
    x = self.fprop(x)
  File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 236, in fprop
    res = self.layers.fprop(x, inference)
  File "/usr/local/lib/python3.5/dist-packages/neon/layers/container.py", line 395, in fprop
    x = l.fprop(x, inference=inference)
  File "/usr/local/lib/python3.5/dist-packages/neon/layers/layer.py", line 1061, in fprop
    bias=self.weight_bias, bsum=self.batch_sum, layer_op=self)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/nervanagpu.py", line 1990, in fprop_conv
    return self._execute_conv("fprop", layer, layer.fprop_kernels, repeat)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/nervanagpu.py", line 2072, in _execute_conv
    kernels.execute(repeat)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/convolution.py", line 551, in execute
    kernel = kernel_specs.get_kernel(self.kernel_name, self.kernel_options)
  File "<decorator-gen-35>", line 2, in get_kernel
  File "/usr/local/lib/python3.5/dist-packages/pycuda/tools.py", line 430, in context_dependent_memoize
    result = func(*args)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/kernel_specs.py", line 842, in get_kernel
    run_command([ "ptxas -v -arch", arch, "-o", cubin_file, ptx_file ])
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/kernel_specs.py", line 785, in run_command
    raise RuntimeError("Error(%d):\n%s\n%s" % (proc.returncode, cmd, err))
RuntimeError: Error(136):
ptxas -v -arch sm_61 -o /home/lixc/.cache/neon/kernels/cubin/sconv_direct_fprop_64x32_SN_bias.cubin /home/lixc/.cache/neon/kernels/ptx/sconv_direct_fprop_64x32_SN_bias.ptx
b'Floating point exception (core dumped)\n'

My train.cfg is:

manifest = [train:/media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted/train-index.csv, test:/media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted/test-index.csv]
manifest_root = /media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted
backend = gpu
epochs = 10
batch_size = 16
eval_freq = 1
log = video-c3d.log
output_file = video-c3d.hdf5
device_id = 1
data_dir = train_output_dir
serialize = 1

Training was done via:

export LD_LIBRARY_PATH=/media/storage1/software/github/neon/mklml_lnx_2018.0.1.20171227/lib:$LD_LIBRARY_PATH
python3 /media/storage1/software/github/neon/examples/video-c3d/train.py -c train.cfg
baojun-nervana commented 6 years ago

@lixiangchun Are you using cuda9? I am using cuda8 and there was issue reported on cuda9.

$nvcc --version │· nvcc: NVIDIA (R) Cuda compiler driver │· Copyright (c) 2005-2016 NVIDIA Corporation │· Built on Tue_Jan_10_13:22:03_CST_2017 │· Cuda compilation tools, release 8.0, V8.0.61

lixiangchun commented 6 years ago

@baojun-nervana Thanks. Yes, I use cuda9. Will go back to cuda8 and try again.