Open harshalcse opened 5 years ago
@Bartzi Right now got this error I tried to reinstall NCCL ,chainer, cupy but still not works even try to upgrade setuptools but still not works
Traceback (most recent call last):
File "chainer/train_svhn.py", line 146, in
please help me on this .
Did you try to install cupy with verbose logging on? Did you check that cupy is able to find NCCL? How do you install cupy?
@Bartzi
Now tried following command
python3 chainer/train_svhn.py datasets/train_dataextract_train/curriculum.json ./log --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -li 10 -b=8 -g=0 -lr 0.000001 --epochs 10 --lr-step 0
but following errot comes
ure, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:151: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
Exception in main training loop:
Invalid operation is performed in: Concat (Forward)
Expect: in_types.size > 0
Actual: 0 <= 0
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
loss = _calc_loss(self._master, batch)
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
return model(*in_arrays)
File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
self.y = self.predictor(*x)
File "/root/see-master/chainer/models/svhn.py", line 214, in __call__
return self.recognition_net(images, h)
File "/root/see-master/chainer/models/svhn.py", line 138, in __call__
final_lstm_predictions = F.concat(final_lstm_predictions, axis=1)
File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 104, in concat
y, = Concat(axis).apply(xs)
File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
self._check_data_type_forward(in_data)
File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward
self.check_type_forward(in_type)
File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 23, in check_type_forward
type_check.expect(in_types.size() > 0)
File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect
expr.expect()
File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect
'{0} {1} {2}'.format(left, self.inv, right))
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "chainer/train_svhn.py", line 258, in <module>
trainer.run()
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 329, in run
six.reraise(*sys.exc_info())
File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
loss = _calc_loss(self._master, batch)
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
return model(*in_arrays)
File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
self.y = self.predictor(*x)
File "/root/see-master/chainer/models/svhn.py", line 214, in __call__
return self.recognition_net(images, h)
File "/root/see-master/chainer/models/svhn.py", line 138, in __call__
final_lstm_predictions = F.concat(final_lstm_predictions, axis=1)
File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 104, in concat
y, = Concat(axis).apply(xs)
File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
self._check_data_type_forward(in_data)
File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward
self.check_type_forward(in_type)
File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 23, in check_type_forward
type_check.expect(in_types.size() > 0)
File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect
expr.expect()
File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect
'{0} {1} {2}'.format(left, self.inv, right))
chainer.utils.type_check.InvalidType:
Invalid operation is performed in: Concat (Forward)
Expect: in_types.size > 0
Actual: 0 <= 0
Using CUDA 9.0 Please help me I am try stuck from last week.
@Bartzi
now I got following error.
Exception in main training loop: cudaErrorNoDevice: no CUDA-capable device is detected
Traceback (most recent call last):
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
entry.extension(self)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/root/.see-master/lib/python3.5/site-packages/chainer/reporter.py", line 98, in scope
yield
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
update()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 195, in update_core
self.setup_workers()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 186, in setup_workers
with cuda.Device(self._devices[0]):
File "cupy/cuda/device.pyx", line 106, in cupy.cuda.device.Device.__enter__
File "cupy/cuda/runtime.pyx", line 164, in cupy.cuda.runtime.getDevice
File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "chainer/train_svhn.py", line 258, in <module>
trainer.run()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
six.reraise(*sys.exc_info())
File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
entry.extension(self)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/root/.see-master/lib/python3.5/site-packages/chainer/reporter.py", line 98, in scope
yield
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
update()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 195, in update_core
self.setup_workers()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 186, in setup_workers
with cuda.Device(self._devices[0]):
File "cupy/cuda/device.pyx", line 106, in cupy.cuda.device.Device.__enter__
File "cupy/cuda/runtime.pyx", line 164, in cupy.cuda.runtime.getDevice
File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorNoDevice: no CUDA-capable device is detected
Please help me .
I don't know what to say except that chainer can not find your GPU. Have you tried turning it off and on again?
@Bartzi
I tried I am using AWS p3 type 2xlarge instance type with following GPU details: root@awsml04:~# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176
Default CUDA 9.0 version export CUDA_PATH=/usr/local/cuda/bin${PATH:+:${CUDA_PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64{LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
chainer versions chainer==3.2.0 chainerui==0.3.0 cupy version cupy==5.3.0
Also tried to rebooted server but still that issue is persist: please help
hmm, I'm not sure but chainer==3.2.0
and cupy==5.3.0
does not sound like a good combination, you should try to use a chainer version that is similar to the cupy version (as it is newer).
Apart from that I don;t know. How did you install cupy?
@Bartzi Tried with cupy-cuda90 still same error
did you try installing cupy with pip install cupy
? And letting the system compile it for you?
@Bartzi yes I already tried please help still issue persist.
tbh, I don;t know how to solve this issue, did you try it with some chainer examples? If they do not work as well, it might be a good idea to ask the developers of the framework...
@Bartzi I already put complete scenario on chainer cupy github, stackoverflow, google developers group but still resolution.
I applied following commands
$ export CUDA_PATH=/usr/local/cuda-9.0 $ export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 $ pip3 uninstall -y chainer cupy cupy-cuda80 cupy-cuda90 cupy-cuda92 $ pip3 install cupy-cuda90 --no-cache-dir && pip3 install chainer --no-cache-dir $ git clone https://github.com/chainer/chainer.git && cd chainer && git checkout v5.3.0 $ python3 examples/mnist/train_mnist.py
also tried to test from last two commands working proper but on our script train_svhn.py
python3 chainer/train_svhn.py datasets/train_dataextract_train/curriculum.json ./log --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -li 10 -b=8 -g=0 -lr 0.000001 --epochs 10 --lr-step 0
If give following error right now
/usr/lib/python3.5/site-packages/chainer/backends/cuda.py:98: UserWarning: cuDNN is not enabled. Please reinstall CuPy after you install cudnn (see https://docs-cupy.chainer.org/en/stable/install.html#install-cudnn). 'cuDNN is not enabled.\n' /usr/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from
floatto
np.floatingis deprecated. In future, it will be treated as
np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:151: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
Exception in main training loop:
Invalid operation is performed in: Reshape (Forward)
Expect: prod(x.shape) % known_size(=32) == 0
Actual: 16 != 0
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
loss = _calc_loss(self._master, batch)
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
return model(in_arrays)
File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 45, in call
self.loss = self.lossfun(self.y, t)
File "/root/see-master/chainer/metrics/svhn_softmax_metrics.py", line 19, in calc_loss
t = F.reshape(t, (batch_size, self.num_timesteps, -1))
File "/usr/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 94, in reshape
y, = Reshape(shape).apply((x,))
File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
self._check_data_type_forward(in_data)
File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward
self.check_type_forward(in_type)
File "/usr/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 37, in check_type_forward
type_check.prod(x_type.shape) % size_var == 0)
File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect
expr.expect()
File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect
'{0} {1} {2}'.format(left, self.inv, right))
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "chainer/train_svhn.py", line 258, in
Expect: prod(x.shape) % known_size(=32) == 0 Actual: 16 != 0 `
Hmm, I'm not sure.
First, you need to set the blank label to 10
. Further I think that your definition of the timesteps
could be wrong, as the tensor can not correctly be reshaped.
Hmm, I'm not sure. First, you need to set the blank label to
10
. Further I think that your definition of thetimesteps
could be wrong, as the tensor can not correctly be reshaped.
set blank label 10 but still same issue but I never getting about defination of timesteps
.
yeah, the blank label thing was just something else that I saw... Could you tell me what kind of data you are using? Could you provide your curriculum.json
file?
@Bartzi I tried to train our SVHN dataset only.
[
{
"train": "/root/see-master/datasets/train_dataextract_train/train.csv",
"validation": "/root/see-master/datasets/train_dataextract_train/train.csv"
}
]
Did you follow step 3 of this subsection in the README?
@Bartzi Yes I follow up this my train.csv is as follows
4 4
/root/see-master/datasets/train_dataextract_train/train/0.png 2 7 0 49 46 50
/root/see-master/datasets/train_dataextract_train/train/1.png 6 10 88 59 87 60
/root/see-master/datasets/train_dataextract_train/train/2.png 8 10 32 46 130 48
/root/see-master/datasets/train_dataextract_train/train/3.png 1 10 140 60 0 60
/root/see-master/datasets/train_dataextract_train/train/4.png 3 7 0 44 151 49
/root/see-master/datasets/train_dataextract_train/train/5.png 7 10 102 55 89 60
/root/see-master/datasets/train_dataextract_train/train/6.png 1 2 0 60 97 60
/root/see-master/datasets/train_dataextract_train/train/7.png 1 9 146 54 150 50
/root/see-master/datasets/train_dataextract_train/train/8.png 6 7 0 60 0 60
/root/see-master/datasets/train_dataextract_train/train/9.png 5 10 140 60 140 60
/root/see-master/datasets/train_dataextract_train/train/10.png 3 3 148 52 0 49
/root/see-master/datasets/train_dataextract_train/train/11.png 2 1 144 56 0 54
/root/see-master/datasets/train_dataextract_train/train/12.png 2 1 123 60 140 60
/root/see-master/datasets/train_dataextract_train/train/13.png 4 10 44 48 9 60
/root/see-master/datasets/train_dataextract_train/train/14.png 1 10 34 47 110 60
/root/see-master/datasets/train_dataextract_train/train/15.png 5 10 0 55 16 60
/root/see-master/datasets/train_dataextract_train/train/16.png 8 4 94 60 140 60
/root/see-master/datasets/train_dataextract_train/train/17.png 1 5 144 56 140 60
/root/see-master/datasets/train_dataextract_train/train/18.png 5 9 42 54 150 50
@Bartzi sir I tried to run examples/mnist/train_mnist.py from chainer github repository but it is working from there but why problem with that code only.
Hmm, could be because of the MultiProcressParallelUpdater
you could try to use a different updater.
4 4
/root/see-master/datasets/train_dataextract_train/train/0.png 2 7 0 49 46 50
I'd say your train file is wrong. Could you provide the image that belongs to this label? And also your char_map
?
@Bartzi I tried with SingleProcessUpdater as well but still gives same issue. and char_map.json is
{
"0": 9250,
"1": 49,
"2": 50,
"3": 51,
"4": 52,
"5": 53,
"6": 54,
"7": 55,
"8": 56,
"9": 57,
"10": 48
}
Okay... your GT file looks completely wrong...
char_map
that should not be the caseCould you please provide a sample image? Preferably /root/see-master/datasets/train_dataextract_train/train/0.png
otherwise you'll have to figure out the correct way to label it by yourself...
@Barzi actually it's SVHN dataset images
Hello @Bartzi,
I'm tring to train SVHN dataset using following train.csv which path is defined in curricum.json as follows Command for trainning svhn dataset is sudo python3 chainer/train_svhn.py "/root/see-master/datasets/train_dataextract_train/curriculum.json" 0 --char-map /root/see-master/datasets/svhn/svhn_char_map.json --test-image /root/see-master/datasets/test_dataextract/train/0.png -b 60
but getting error as follows: Traceback (most recent call last): File "chainer/train_svhn.py", line 75, in
train_dataset, validation_dataset = curriculum.load_dataset(0)
File "/root/see-master/chainer/utils/baby_step_curriculum.py", line 38, in load_dataset
train_dataset = self.dataset_class(self.train_curriculum[level], **self.dataset_args)
File "/root/see-master/chainer/datasets/file_dataset.py", line 31, in init
self.num_timesteps, self.num_labels = (int(i) for i in next(reader))
File "/root/see-master/chainer/datasets/file_dataset.py", line 31, in
self.num_timesteps, self.num_labels = (int(i) for i in next(reader))
ValueError: invalid literal for int() with base 10: '/root/see-master/datasets/train_dataextract_train/train/0.png'
Request you to help me to resolve the issue.