apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

index out of bound error when update eval metric #7664

Open wenhe-jia opened 7 years ago

wenhe-jia commented 7 years ago

Hi ,I am training a binary classification model with my own dataset. I use mx.image.ImageIter API to load raw images according to the .lst file generated myself(without using img2rec.py). I set the data iter as below,

train_iter = mx.image.ImageIter(
        batch_size   = batch_size,
        data_shape   = data_shape,
        path_imglist = '/database/liveness/data_prepare/liveness_train.lst',
        path_root    = '/',
        data_name    = 'data',
        label_name   = 'softmax_label',
        mean         = np.array([123.68, 116.78, 103.94]),
        resize       = 224,
        rand_mirror  = True,
        shuffle      = False,
        inter_method = 1)

And my .lst file is

16      1.0     /database/liveness/lf_face/real/5905c125337f3131b4f0856a_image_0.jpg
17      1.0     /database/liveness/lf_face/real/58dd0a71337f311c56026c9e_image_0.jpg
18      1.0     /database/liveness/lf_face/real/58fb11d6de6c741b3501fd52_image_0.jpg
19      1.0     /database/liveness/lf_face/real/59060ef0337f317a0f74a42e_image_0.jpg

Then i start training, the first epoch went well, but a error was reported at the second epoch(epoch 1) as follow,

epoch 0 / batch 3766 ======>> ('cross-entropy', 0.00092357182168862099)
2017-08-30 14:39:53.567102
Time cost(ms) on one batch: 797260
DataBatch: data shapes: [(128L, 3L, 224L, 224L)] label shapes: [(128L,)]
2017-08-30 14:39:54.532427
Traceback (most recent call last):
  File "train.py", line 111, in <module>
    mod.update_metric(metric, batch.label)
  File "/dlproject/incubator-mxnet/python/mxnet/module/module.py", line 735, in update_metric
    self._exec_group.update_metric(eval_metric, labels)
  File "/dlproject/incubator-mxnet/python/mxnet/module/executor_group.py", line 582, in update_metric
    eval_metric.update_dict(labels_, preds)
  File "/dlproject/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict
    self.update(label, pred)
  File "/dlproject/incubator-mxnet/python/mxnet/metric.py", line 916, in update
    prob = pred[numpy.arange(label.shape[0]), numpy.int64(label)]
IndexError: index 8285818191872 is out of bounds for axis 1 with size 2

batch 3766 is the second last batch of a epoch, and batch 3767 is the last batch of a epoch. I set the eval metric in my training script with two components:

eval_metric = mx.metric.CompositeEvalMetric()
eval_metric.add(mx.metric.CrossEntropy())
eval_metric.add(mx.metric.Accuracy())

so what is wrong in my usage? Thx for your answer!

techzhou commented 7 years ago

I have same problem

changss commented 7 years ago

same problem,too.

tobechao commented 6 years ago

I have same problem,too.

wlbksy commented 6 years ago

same problem here

tobechao commented 6 years ago

I add a while loop in image.py: def next(self): ... try: while i < batch_size: label, s = self.next_sample() data = self.imdecode(s) try: self.check_valid_image(data) except RuntimeError as e: logging.debug('Invalid image, skipping: %s', str(e)) continue data = self.augmentation_transform(data) assert i < batch_size, 'Batch size must be multiples of augmenter output length' batch_data[i] = self.postprocess_data(data) batch_label[i] = label i += 1 except StopIteration: if not i: raise StopIteration while i < batch_size: import copy batch_data[i] = copy.deepcopy(batch_data[0]) batch_label[i] = copy.deepcopy(batch_label[0]) i += 1 ...

I copy the first batch_szie-i times, It can works.

wewan commented 6 years ago

I was training binary classification using .rec , met the same problem

wenhe-jia commented 6 years ago

Maybe we should make our .rec files in our own ways to make sure it has no problem.

vandanavk commented 5 years ago

@mxnet-label-bot add [Metric]

anirudhacharya commented 5 years ago

@LeonJWH @techzhou @changss @tobechao @wlbksy can one of please share a minimum reproducible example for this bug?