Error training FSNS - Githubissues

anavc94 commented 6 years ago

Hello,

I'm trying to train the network but I'm having problems. I'm using some pictures from the FSNS dataset, not all of them, because my intention is just to know as fast as possible how to train the network in order to apply it to my own data after that.

When I execute:

python train_fsns.py --char-map ..\datasets\fsns\fsns_char_map.json ..\datasets\fsns\curriculum.json .\logs\ --blank-label 0 -b 32 --gpu 0

I got an output like this:

... ... could not load image: .\data_image\test\test\00000\0.png could not load image: .\data_image\test\test\00000\25.png could not load image: .\data_image\test\test\00000\38.png could not load image: .\data_image\test\test\00000\14.png could not load image: .\data_image\test\test\00000\35.png could not load image: .\data_image\test\test\00000\47.png could not load image: .\data_image\test\test\00000\38.png could not load image: .\data_image\test\test\00000\17.png could not load image: .\data_image\test\test\00000\2.png ... ...

I've placed the folders correctly:

chainer

data_image train_fsns.py

train validation test

test

00000 ...

Moreover, I've done some changes in train_fsns.py. I've change MultiprocessParallelUpdater by StandardUpdater because NCCL was giving me errors (I'm on Windows platform) and I only have one GPU.

I've changed:

line 11 to: from chainer.training import StandardUpdater line 172 to: updater = StandardUpdater(train_iterators, optimizer) and comment lines 222 and 236.

Is it correct?

Thank you so much and congrats for the project!

Bartzi commented 6 years ago

Hi,

I think your changes are correct, but your file with all image paths isn't. This file should contain absolute paths to each image, not relative paths, as it seems to be with your file.

It should work after you changed that.

anavc94 commented 6 years ago

Okay, I've just noticed that error! I was doing it with relative paths...

Thank you so much for your help!

anavc94 commented 6 years ago

Hello,

after that changes when i execute python train_fsns.py as above it gives me two errors:

C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\h5py__init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Exception in main training loop: list indices must be integers or slices, not str Traceback (most recent call last): File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 296, in run while not stop_trigger(self): File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\triggers\interval_trigger.py", line 51, in call epoch_detail = updater.epoch_detail File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 159, in epoch_detail return self._iterators['main'].epoch_detail Will finalize trainer extensions and updater before reraising the exception. Traceback (most recent call last): File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 313, in run six.reraise(*sys.exc_info()) File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise raise value File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 296, in run while not stop_trigger(self): File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\triggers\interval_trigger.py", line 51, in call__ epoch_detail = updater.epoch_detail File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 159, in epoch_detail return self._iterators['main'].epoch_detail TypeError: list indices must be integers or slices, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_fsns.py", line 296, in trainer.run() File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 319, in run self.updater.finalize() File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 177, in finalize for iterator in six.itervalues(self._iterators): File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 584, in itervalues return iter(d.values(**kw)) AttributeError: 'list' object has no attribute 'values'

I don't exactly how to fix it, it seems like errors in the chainer package but I have no idea. Do you know what could happen?

Thnx

Bartzi commented 6 years ago

Yes, the problem is that you changed from the MultiprocessParallelUpdater to StandardUpdater and did not correctly initialize the Standardupdater. You should check that...

anavc94 commented 6 years ago

Hello again, Christian

I think I've been able to correct the last error. However, I am getting this output from the command line:

Exception in main training loop: out of memory to allocate 747110400 bytes (total 5722942464 bytes)
Traceback (most recent call last):
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 299, in run
    update()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 223, in update
    self.update_core()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 234, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\optimizer.py", line 541, in update
    loss = lossfun(*args, **kwds)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\utils\multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns.py", line 525, in __call__
    return self.recognition_net(images, h)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns_resnet.py", line 44, in __call__
    h = self.resnet(rois, layers=['res5', 'pool5'])
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\links\model\vision\resnet.py", line 195, in __call__
    h = func(h)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\links\normalization\batch_normalization.py", line 144, in __call__
    running_var=self.avg_var, decay=decay)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\normalization\batch_normalization.py", line 545, in batch_normalization
    (x, gamma, beta))[0]
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\function_node.py", line 245, in apply
    outputs = self.forward(in_data)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\normalization\batch_normalization.py", line 95, in forward
    y = cuda.cupy.empty_like(x)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\cupy\creation\basic.py", line 41, in empty_like
    return cupy.ndarray(a.shape, dtype=dtype)
  File "cupy/core/core.pyx", line 93, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 392, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 807, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 828, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 591, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 639, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "cupy/cuda/memory.pyx", line 621, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 561, in cupy.cuda.memory.SingleDeviceMemoryPool._alloc
  File "cupy/cuda/memory.pyx", line 347, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 348, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 45, in cupy.cuda.memory.Memory.__init__
  File "cupy/cuda/runtime.pyx", line 213, in cupy.cuda.runtime.malloc
  File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cupy/cuda/memory.pyx", line 627, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 561, in cupy.cuda.memory.SingleDeviceMemoryPool._alloc
  File "cupy/cuda/memory.pyx", line 347, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 348, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 45, in cupy.cuda.memory.Memory.__init__
  File "cupy/cuda/runtime.pyx", line 213, in cupy.cuda.runtime.malloc
  File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cupy/cuda/memory.pyx", line 633, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 561, in cupy.cuda.memory.SingleDeviceMemoryPool._alloc
  File "cupy/cuda/memory.pyx", line 347, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 348, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 45, in cupy.cuda.memory.Memory.__init__
  File "cupy/cuda/runtime.pyx", line 213, in cupy.cuda.runtime.malloc
  File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_fsns.py", line 302, in <module>
    trainer.run()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise
    raise value
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 299, in run
    update()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 223, in update
    self.update_core()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\updater.py", line 234, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\optimizer.py", line 541, in update
    loss = lossfun(*args, **kwds)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\utils\multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns.py", line 525, in __call__
    return self.recognition_net(images, h)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns_resnet.py", line 44, in __call__
    h = self.resnet(rois, layers=['res5', 'pool5'])
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\links\model\vision\resnet.py", line 195, in __call__
    h = func(h)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\links\normalization\batch_normalization.py", line 144, in __call__
    running_var=self.avg_var, decay=decay)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\normalization\batch_normalization.py", line 545, in batch_normalization
    (x, gamma, beta))[0]
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\function_node.py", line 245, in apply
    outputs = self.forward(in_data)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\normalization\batch_normalization.py", line 95, in forward
    y = cuda.cupy.empty_like(x)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\cupy\creation\basic.py", line 41, in empty_like
    return cupy.ndarray(a.shape, dtype=dtype)
  File "cupy/core/core.pyx", line 93, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 392, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 807, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 828, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 591, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 639, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 747110400 bytes (total 5722942464 bytes)

is it because of my GPU? It seems like it starts training but after three or four seconds it gives this error. I am using GeForce GTX 1060 6GB and CUDA v.09 and cuddn v.7.

Thanks a lot

Bartzi commented 6 years ago

If you look closely enough, you will see that your GPU does not have enough emory for what you are trying to do. You could try to decrease the batch size.

anavc94 commented 6 years ago

I've tried to decrease the batch size up to 6, a low value, but I'm still getting Out Of Memory Error. I've followed GPU usage and yep, it seems to need more than my 6GB memory... Should I try to decrease the images input size? Maybe I should consider try it with another GPU... it's not the first time I am having this trouble. Thanks a lot again!

Bartzi commented 6 years ago

Yes, you can also try to decrease the size of your input images and also the number of timesteps, the localization network is running. This could help too. Depends on your data, whether you can / want to do this.

anavc94 commented 6 years ago

I've tried both, and when decreasing the size of the input images to 200x50 it seems to start the training but i get this:

File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\connection\convolution_2d.py", line 109, in forward_cpu .format(type(W), type(x), type(b))) ValueError: numpy and cupy must not be used together type(W): <class 'cupy.core.core.ndarray'>, type(x): <class 'numpy.ndarray'>, type(b): <class 'cupy.core.core.ndarray'>

my versions of numpy and cupy are the ones from requirements.txt. I have read another issue about this problem and I think that I've to copy the model to GPU. In train_fsns.py will it be something like writing "model.to_gpu(0)" before calling trainer.run()?

As I've read that you cannot reproduce this error on your machine, the complete output is:


E:\ANA\OCR_ID_DORSAL\see-master\chainer>python train_fsns.py --char-map ..\datasets\dorsales_sizereduced\fsns_char_map.json ..\datasets\dorsales_sizereduced\curriculum.json .\logs_dorsales_reduced\ --blank-label 0 -b 6 --gpu 0 --timesteps 1
epoch       iteration   main/loss   main/accuracy  lr          fast_validation/main/loss  fast_validation/main/accuracy  validation/main/loss  validation/main/accuracy
Exception in main training loop: numpy and cupy must not be used together
type(W): <class 'cupy.core.core.ndarray'>, type(x): <class 'numpy.ndarray'>, type(b): <class 'cupy.core.core.ndarray'>
Traceback (most recent call last):hs
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 302, in run
    entry.extension(self)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\extensions\evaluator.py", line 137, in __call__
    result = self.evaluate()
  File "C:\Users\EtiqmediaServer\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\extensions\evaluator.py", line 184, in evaluate
    eval_func(*in_arrays)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\utils\multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns.py", line 519, in __call__
    h = self.localization_net(images)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns.py", line 182, in __call__
    h = self.bn0(self.conv0(images))
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\links\connection\convolution_2d.py", line 156, in __call__
    x, self.W, self.b, self.stride, self.pad)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\connection\convolution_2d.py", line 467, in convolution_2d
    y, = fnode.apply(args)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\function_node.py", line 245, in apply
    outputs = self.forward(in_data)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\function_node.py", line 338, in forward
    return self.forward_cpu(inputs)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\connection\convolution_2d.py", line 109, in forward_cpu
    .format(type(W), type(x), type(b)))
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_fsns.py", line 305, in <module>
    trainer.run()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise
    raise value
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\trainer.py", line 302, in run
    entry.extension(self)
  File "C:\Users\EtiqmediaServer\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\extensions\evaluator.py", line 137, in __call__
    result = self.evaluate()
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\training\extensions\evaluator.py", line 184, in evaluate
    eval_func(*in_arrays)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\utils\multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns.py", line 519, in __call__
    h = self.localization_net(images)
  File "E:\ANA\OCR_ID_DORSAL\see-master\chainer\models\fsns.py", line 182, in __call__
    h = self.bn0(self.conv0(images))
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\links\connection\convolution_2d.py", line 156, in __call__
    x, self.W, self.b, self.stride, self.pad)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\connection\convolution_2d.py", line 467, in convolution_2d
    y, = fnode.apply(args)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\function_node.py", line 245, in apply
    outputs = self.forward(in_data)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\function_node.py", line 338, in forward
    return self.forward_cpu(inputs)
  File "C:\Users\server\AppData\Local\Programs\Python\Python36\lib\site-packages\chainer\functions\connection\convolution_2d.py", line 109, in forward_cpu
    .format(type(W), type(x), type(b)))
ValueError: numpy and cupy must not be used together
type(W): <class 'cupy.core.core.ndarray'>, type(x): <class 'numpy.ndarray'>, type(b): <class 'cupy.core.core.ndarray'>´´

This is interesting. I am sorry for asking you so many questions!

Bartzi commented 6 years ago

Yes, this is really interesting. I'm not 100% sure why this happens. It should not happen, as soon as you specify something like --gpu 0 while running the train command.

anavc94 commented 6 years ago

Hello @Bartzi , I could solve my problem. When I changed train_fsns.py to use one GPU instead of multi-GPU I made a mistake. I'm going to copy the code here in order to help someone who's trying to do the same as me and close this issue. Thank you so much for your help and your answers. train_fsns .txt

lucifer2859 commented 6 years ago

Hello @anavc94 How to install cupy in Windows?

Bartzi / see

Error training FSNS #23