NervanaSystems / neon

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware
http://neon.nervanasys.com/docs/latest
Apache License 2.0
3.87k stars 811 forks source link

Dtype issues with gpu backend #449

Open zhiltsov-max opened 6 years ago

zhiltsov-max commented 6 years ago

Hello, I was experimenting with Neon and had faced an issue with the convolutional and pooling layers. The task was image classification, so the input data shape was (3, H, W). If an ArrayIterator or HDF5Iterator are used as datasets, then the input shape values might have numpy datatypes like numpy.int64 (for ArrayIterator it is provided by lshape parameter, for HDF5Iterator they are retrieved from file['input'].attrs['lshape']). When these values are passed to the model configure method as in_obj, they are assigned to the layer.in_shape. After this, in_shape is used to initialize layer parameters. Next, during the forward pass, the following errors arise:

Layer parameters:

In "<>/neon/backends/convolution.py", line 75, in __init__:
(N, C, K, D, H, W, T, R, S, M, P, Q, pad_d, pad_h, pad_w, str_d, str_h, str_w, dil_d, dil_h, dil_w)

Have following values (idx, type, value):

[(0, <class 'int'>, 128), (1, <class 'numpy.int64'>, 3), (2, <class 'int'>, 32), (3, <class 'int'>, 1), (4, <class 'numpy.int64'>, 128), (5, <class 'numpy.int64'>, 128), (6, <class 'int'>, 1), (7, <class 'int'>, 3), (8, <class 'int'>, 3), (9, <class 'int'>, 1), (10, <class 'numpy.int64'>, 128), (11, <class 'numpy.int64'>, 128), (12, <class 'int'>, 0), (13, <class 'int'>, 2), (14, <class 'int'>, 2), (15, <class 'int'>, 1), (16, <class 'int'>, 1), (17, <class 'int'>, 1), (18, <class 'int'>, 1), (19, <class 'int'>, 2), (20, <class 'int'>, 2)]

Casting all parameters to int in layer initialization fixes the issue for me, but it seems not like a proper solution. Casting elements of lshape to int also helps. I think it would be great if the input values be checked or be converted to the expected types on the library side. Other layer types (like linear, batchnorm, recurrent, etc.) and backends (cpu, mkl) which I had used, had not shown to suffer from this issue.

Environment: python 3.5.2, neon 2.6.0 (f9d771bbb5f5fa3ae129748596d0ced5389c7f88), cuda 8.0, gpu K40s, ubuntu 16.04, boost 1.58.0, pycuda 2017.1.1, numpy 1.13.1.

baojun-nervana commented 6 years ago

@zhiltsov-max Agreed. A type check is needed here.