Unable to train SVHN Dataset

harshalcse commented 5 years ago

Hello @Bartzi,

I'm tring to train SVHN dataset using following train.csv which path is defined in curricum.json as follows train Command for trainning svhn dataset is sudo python3 chainer/train_svhn.py "/root/see-master/datasets/train_dataextract_train/curriculum.json" 0 --char-map /root/see-master/datasets/svhn/svhn_char_map.json --test-image /root/see-master/datasets/test_dataextract/train/0.png -b 60

but getting error as follows: Traceback (most recent call last): File "chainer/train_svhn.py", line 75, in train_dataset, validation_dataset = curriculum.load_dataset(0) File "/root/see-master/chainer/utils/baby_step_curriculum.py", line 38, in load_dataset train_dataset = self.dataset_class(self.train_curriculum[level], **self.dataset_args) File "/root/see-master/chainer/datasets/file_dataset.py", line 31, in init self.num_timesteps, self.num_labels = (int(i) for i in next(reader)) File "/root/see-master/chainer/datasets/file_dataset.py", line 31, in self.num_timesteps, self.num_labels = (int(i) for i in next(reader)) ValueError: invalid literal for int() with base 10: '/root/see-master/datasets/train_dataextract_train/train/0.png'

Request you to help me to resolve the issue.

Bartzi commented 5 years ago

See my answer to one of your questions here...

harshalcse commented 5 years ago

@Bartzi Right now got this error I tried to reinstall NCCL ,chainer, cupy but still not works even try to upgrade setuptools but still not works

Traceback (most recent call last): File "chainer/train_svhn.py", line 146, in updater = MultiprocessParallelUpdater(train_iterators, optimizer, devices=args.gpus) File "/usr/local/lib/python3.5/dist-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 124, in init 'NCCL is not enabled. MultiprocessParallelUpdater ' Exception: NCCL is not enabled. MultiprocessParallelUpdater requires NCCL. Please reinstall CuPy after you install NCCL. (see https://docs-cupy.chainer.org/en/latest/install.html)

please help me on this .

Bartzi commented 5 years ago

Did you try to install cupy with verbose logging on? Did you check that cupy is able to find NCCL? How do you install cupy?

harshalcse commented 5 years ago

@Bartzi
Now tried following command python3 chainer/train_svhn.py datasets/train_dataextract_train/curriculum.json ./log --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -li 10 -b=8 -g=0 -lr 0.000001 --epochs 10 --lr-step 0

but following errot comes

ure, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:151: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
  format(optimizer.eps))
Exception in main training loop:
Invalid operation is performed in: Concat (Forward)

Expect: in_types.size > 0
Actual: 0 <= 0
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
    loss = _calc_loss(self._master, batch)
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
    return model(*in_arrays)
  File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "/root/see-master/chainer/models/svhn.py", line 214, in __call__
    return self.recognition_net(images, h)
  File "/root/see-master/chainer/models/svhn.py", line 138, in __call__
    final_lstm_predictions = F.concat(final_lstm_predictions, axis=1)
  File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 104, in concat
    y, = Concat(axis).apply(xs)
  File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
    self._check_data_type_forward(in_data)
  File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward
    self.check_type_forward(in_type)
  File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 23, in check_type_forward
    type_check.expect(in_types.size() > 0)
  File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect
    expr.expect()
  File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect
    '{0} {1} {2}'.format(left, self.inv, right))
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "chainer/train_svhn.py", line 258, in <module>
    trainer.run()
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
    loss = _calc_loss(self._master, batch)
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
    return model(*in_arrays)
  File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "/root/see-master/chainer/models/svhn.py", line 214, in __call__
    return self.recognition_net(images, h)
  File "/root/see-master/chainer/models/svhn.py", line 138, in __call__
    final_lstm_predictions = F.concat(final_lstm_predictions, axis=1)
  File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 104, in concat
    y, = Concat(axis).apply(xs)
  File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
    self._check_data_type_forward(in_data)
  File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward
    self.check_type_forward(in_type)
  File "/usr/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 23, in check_type_forward
    type_check.expect(in_types.size() > 0)
  File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect
    expr.expect()
  File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect
    '{0} {1} {2}'.format(left, self.inv, right))
chainer.utils.type_check.InvalidType:
Invalid operation is performed in: Concat (Forward)

Expect: in_types.size > 0
Actual: 0 <= 0

Using CUDA 9.0 Please help me I am try stuck from last week.

harshalcse commented 5 years ago

@Bartzi

now I got following error.

Exception in main training loop: cudaErrorNoDevice: no CUDA-capable device is detected
Traceback (most recent call last):
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/.see-master/lib/python3.5/site-packages/chainer/reporter.py", line 98, in scope
    yield
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
    update()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
    self.update_core()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 195, in update_core
    self.setup_workers()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 186, in setup_workers
    with cuda.Device(self._devices[0]):
  File "cupy/cuda/device.pyx", line 106, in cupy.cuda.device.Device.__enter__
  File "cupy/cuda/runtime.pyx", line 164, in cupy.cuda.runtime.getDevice
  File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "chainer/train_svhn.py", line 258, in <module>
    trainer.run()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/.see-master/lib/python3.5/site-packages/chainer/reporter.py", line 98, in scope
    yield
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
    update()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
    self.update_core()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 195, in update_core
    self.setup_workers()
  File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 186, in setup_workers
    with cuda.Device(self._devices[0]):
  File "cupy/cuda/device.pyx", line 106, in cupy.cuda.device.Device.__enter__
  File "cupy/cuda/runtime.pyx", line 164, in cupy.cuda.runtime.getDevice
  File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorNoDevice: no CUDA-capable device is detected

Please help me .

Bartzi commented 5 years ago

I don't know what to say except that chainer can not find your GPU. Have you tried turning it off and on again?

harshalcse commented 5 years ago

@Bartzi

I tried I am using AWS p3 type 2xlarge instance type with following GPU details: root@awsml04:~# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176

nvidia-smi

Default CUDA 9.0 version export CUDA_PATH=/usr/local/cuda/bin${PATH:+:${CUDA_PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64{LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

chainer versions chainer==3.2.0 chainerui==0.3.0 cupy version cupy==5.3.0

Also tried to rebooted server but still that issue is persist: please help

Bartzi commented 5 years ago

hmm, I'm not sure but chainer==3.2.0 and cupy==5.3.0 does not sound like a good combination, you should try to use a chainer version that is similar to the cupy version (as it is newer). Apart from that I don;t know. How did you install cupy?

harshalcse commented 5 years ago

@Bartzi Tried with cupy-cuda90 still same error

Bartzi commented 5 years ago

did you try installing cupy with pip install cupy? And letting the system compile it for you?

harshalcse commented 5 years ago

@Bartzi yes I already tried please help still issue persist.

Bartzi commented 5 years ago

tbh, I don;t know how to solve this issue, did you try it with some chainer examples? If they do not work as well, it might be a good idea to ask the developers of the framework...

harshalcse commented 5 years ago

@Bartzi I already put complete scenario on chainer cupy github, stackoverflow, google developers group but still resolution.

harshalcse commented 5 years ago

I applied following commands $ export CUDA_PATH=/usr/local/cuda-9.0 $ export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 $ pip3 uninstall -y chainer cupy cupy-cuda80 cupy-cuda90 cupy-cuda92 $ pip3 install cupy-cuda90 --no-cache-dir && pip3 install chainer --no-cache-dir $ git clone https://github.com/chainer/chainer.git && cd chainer && git checkout v5.3.0 $ python3 examples/mnist/train_mnist.py also tried to test from last two commands working proper but on our script train_svhn.py python3 chainer/train_svhn.py datasets/train_dataextract_train/curriculum.json ./log --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -li 10 -b=8 -g=0 -lr 0.000001 --epochs 10 --lr-step 0

If give following error right now /usr/lib/python3.5/site-packages/chainer/backends/cuda.py:98: UserWarning: cuDNN is not enabled. Please reinstall CuPy after you install cudnn (see https://docs-cupy.chainer.org/en/stable/install.html#install-cudnn). 'cuDNN is not enabled.\n' /usr/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters /usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:151: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size. format(optimizer.eps)) Exception in main training loop: Invalid operation is performed in: Reshape (Forward)

Expect: prod(x.shape) % known_size(=32) == 0 Actual: 16 != 0 Traceback (most recent call last): File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run update() File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update self.update_core() File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core loss = _calc_loss(self._master, batch) File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss return model(in_arrays) File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 45, in call self.loss = self.lossfun(self.y, t) File "/root/see-master/chainer/metrics/svhn_softmax_metrics.py", line 19, in calc_loss t = F.reshape(t, (batch_size, self.num_timesteps, -1)) File "/usr/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 94, in reshape y, = Reshape(shape).apply((x,)) File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply self._check_data_type_forward(in_data) File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward self.check_type_forward(in_type) File "/usr/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 37, in check_type_forward type_check.prod(x_type.shape) % size_var == 0) File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect expr.expect() File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect '{0} {1} {2}'.format(left, self.inv, right)) Will finalize trainer extensions and updater before reraising the exception. Traceback (most recent call last): File "chainer/train_svhn.py", line 258, in trainer.run() File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 329, in run six.reraise(sys.exc_info()) File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise raise value File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run update() File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update self.update_core() File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core loss = _calc_loss(self._master, batch) File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss return model(*in_arrays) File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 45, in call self.loss = self.lossfun(self.y, t) File "/root/see-master/chainer/metrics/svhn_softmax_metrics.py", line 19, in calc_loss t = F.reshape(t, (batch_size, self.num_timesteps, -1)) File "/usr/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 94, in reshape y, = Reshape(shape).apply((x,)) File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply self._check_data_type_forward(in_data) File "/usr/lib/python3.5/site-packages/chainer/function_node.py", line 330, in _check_data_type_forward self.check_type_forward(in_type) File "/usr/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 37, in check_type_forward type_check.prod(x_type.shape) % size_var == 0) File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 546, in expect expr.expect() File "/usr/lib/python3.5/site-packages/chainer/utils/type_check.py", line 483, in expect '{0} {1} {2}'.format(left, self.inv, right)) chainer.utils.type_check.InvalidType: Invalid operation is performed in: Reshape (Forward)

Expect: prod(x.shape) % known_size(=32) == 0 Actual: 16 != 0 `

Bartzi commented 5 years ago

Hmm, I'm not sure. First, you need to set the blank label to 10. Further I think that your definition of the timesteps could be wrong, as the tensor can not correctly be reshaped.

harshalcse commented 5 years ago

Hmm, I'm not sure. First, you need to set the blank label to 10. Further I think that your definition of the timesteps could be wrong, as the tensor can not correctly be reshaped.

set blank label 10 but still same issue but I never getting about defination of timesteps.

Bartzi commented 5 years ago

yeah, the blank label thing was just something else that I saw... Could you tell me what kind of data you are using? Could you provide your curriculum.json file?

harshalcse commented 5 years ago

@Bartzi I tried to train our SVHN dataset only.

[
        {
                "train": "/root/see-master/datasets/train_dataextract_train/train.csv",
                "validation": "/root/see-master/datasets/train_dataextract_train/train.csv"
        }
]

Bartzi commented 5 years ago

Did you follow step 3 of this subsection in the README?

harshalcse commented 5 years ago

Did you follow step 3 of this subsection in the README?

@Bartzi Yes I follow up this my train.csv is as follows

4    4
/root/see-master/datasets/train_dataextract_train/train/0.png   2       7       0       49      46      50
/root/see-master/datasets/train_dataextract_train/train/1.png   6       10      88      59      87      60
/root/see-master/datasets/train_dataextract_train/train/2.png   8       10      32      46      130     48
/root/see-master/datasets/train_dataextract_train/train/3.png   1       10      140     60      0       60
/root/see-master/datasets/train_dataextract_train/train/4.png   3       7       0       44      151     49
/root/see-master/datasets/train_dataextract_train/train/5.png   7       10      102     55      89      60
/root/see-master/datasets/train_dataextract_train/train/6.png   1       2       0       60      97      60
/root/see-master/datasets/train_dataextract_train/train/7.png   1       9       146     54      150     50
/root/see-master/datasets/train_dataextract_train/train/8.png   6       7       0       60      0       60
/root/see-master/datasets/train_dataextract_train/train/9.png   5       10      140     60      140     60
/root/see-master/datasets/train_dataextract_train/train/10.png  3       3       148     52      0       49
/root/see-master/datasets/train_dataextract_train/train/11.png  2       1       144     56      0       54
/root/see-master/datasets/train_dataextract_train/train/12.png  2       1       123     60      140     60
/root/see-master/datasets/train_dataextract_train/train/13.png  4       10      44      48      9       60
/root/see-master/datasets/train_dataextract_train/train/14.png  1       10      34      47      110     60
/root/see-master/datasets/train_dataextract_train/train/15.png  5       10      0       55      16      60
/root/see-master/datasets/train_dataextract_train/train/16.png  8       4       94      60      140     60
/root/see-master/datasets/train_dataextract_train/train/17.png  1       5       144     56      140     60
/root/see-master/datasets/train_dataextract_train/train/18.png  5       9       42      54      150     50

harshalcse commented 5 years ago

@Bartzi sir I tried to run examples/mnist/train_mnist.py from chainer github repository but it is working from there but why problem with that code only.

Bartzi commented 5 years ago

Hmm, could be because of the MultiProcressParallelUpdater you could try to use a different updater.

4    4
/root/see-master/datasets/train_dataextract_train/train/0.png   2       7       0       49      46      50

I'd say your train file is wrong. Could you provide the image that belongs to this label? And also your char_map?

harshalcse commented 5 years ago

@Bartzi I tried with SingleProcessUpdater as well but still gives same issue. and char_map.json is

{
    "0": 9250,
    "1": 49,
    "2": 50,
    "3": 51,
    "4": 52,
    "5": 53,
    "6": 54,
    "7": 55,
    "8": 56,
    "9": 57,
    "10": 48
}

Bartzi commented 5 years ago

Okay... your GT file looks completely wrong...

I think either the first line is not correct, or your label padding is not correct and that leads to your last problem
The labels you are using for your images are not contained in the char_map that should not be the case

Could you please provide a sample image? Preferably /root/see-master/datasets/train_dataextract_train/train/0.png otherwise you'll have to figure out the correct way to label it by yourself...

harshalcse commented 5 years ago

@Barzi actually it's SVHN dataset images

Bartzi / see

Unable to train SVHN Dataset #65