apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

ImageRecordIOParser std::bad_alloc Error when or after decoding #5525

Closed ysh329 closed 7 years ago

ysh329 commented 7 years ago

Environment info

Jetson TX1, 4G RAM, Ubuntu16.04 64bit, MXNet 0.94

  1. RAM is enough. I excluded the cause of the memory firstly. I made rec format data less 1MB using im2rec.py. However, it still occurs this problem message.

I use caltech-256 data set from mxnet/example/image-classification/data/caltech256.sh and train a DeepID model based on mxnet/example/image-classification/train_cifar10.py.

Error Message:

(py-mxnet-SSD) yuanshuai@tegra-ubuntu:~/sdcard/code/mxnet_inference/deepid$ python train_cifar10_resize224.py                                                                             INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=4, data_train='/home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-train.rec', data_val='/home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-val.rec', disp_batches=20, gpus='0', image_shape='3,224,224', kv_store='device', load_epoch=None, lr=0.0005, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix='./deepid-caltech-256', mom=0.9, monitor=0, network=None, num_classes=256, num_epochs=1, num_examples=25574, num_layers=None, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[09:58:25] src/io/iter_image_recordio.cc:209: ImageRecordIOParser: /home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-train.rec, use 1 threads for decoding..
[09:58:30] src/io/iter_image_recordio.cc:209: ImageRecordIOParser: /home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-val.rec, use 1 threads for decoding..
Killed
(py-mxnet-SSD) yuanshuai@tegra-ubuntu:~/sdcard/code/mxnet_inference/deepid$ python train_cifar10_resize224.py 
INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=4, data_train='/home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-train.rec', data_val='/home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-val.rec', disp_batches=20, gpus='0', image_shape='3,224,224', kv_store='device', load_epoch=None, lr=0.0005, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix='./deepid-caltech-256', mom=0.9, monitor=0, network=None, num_classes=256, num_epochs=1, num_examples=25574, num_layers=None, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[09:58:41] src/io/iter_image_recordio.cc:209: ImageRecordIOParser: /home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-train.rec, use 1 threads for decoding..
[09:58:47] src/io/iter_image_recordio.cc:209: ImageRecordIOParser: /home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-val.rec, use 1 threads for decoding..
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

Key Code

def download_cifar10():
    data_dir="data"
    fnames = (os.path.join(data_dir, "cifar10_train.rec"),
              os.path.join(data_dir, "cifar10_val.rec"))
    #download_file('http://data.mxnet.io/data/cifar10/cifar10_val.rec', fnames[1])
    #download_file('http://data.mxnet.io/data/cifar10/cifar10_train.rec', fnames[0])
    fnames = list(fnames)
    fnames[0] = '/home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-train.rec'
    fnames[1] = '/home/yuanshuai/sdcard/code/mxnet/example/image-classification/data/caltech256-val.rec'
    return fnames
if __name__ == '__main__':
    # download data
    (train_fname, val_fname) = download_cifar10()

    # parse args
    parser = argparse.ArgumentParser(description="train cifar10",
                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    fit.add_fit_args(parser)
    data.add_data_args(parser)
    data.add_data_aug_args(parser)
    data.set_data_aug_level(parser, 2)
    parser.set_defaults(
        # network
        #network        = 'resnet',
        #num_layers     = 110,
        # data
        data_train     = train_fname,
        data_val       = val_fname,
        num_classes    = 256,
        num_examples  = 25574,
        image_shape    = '3,224,224',#90
        pad_size       = 4,
        # train
        gpus           = '0',
        batch_size     = 128,
        num_epochs     = 1,#300
        lr             = .0005,#.05
        lr_step_epochs = '200,250',
        model_prefix   = './deepid-caltech-256'
    )
    args = parser.parse_args()

    # load network
    from importlib import import_module
    #net = import_module('symbols.'+args.network)
    #sym = net.get_symbol(**vars(args))
    sym = get_symbol(256)

    # train
    fit.fit(args, sym, data.get_rec_iter)

Complete Code

piiswrong commented 7 years ago

@ptrendx

ysh329 commented 7 years ago

similar to these open-status questions below: :cry:

4299

2099

2113

ptrendx commented 7 years ago

Could you try setting env variable MXNET_ENGINE_TYPE to NaiveEngine and set DEBUG to 1 in config.mk, rebuild and run through gdb? Then once an error occurs type bt and paste an output here? This will make a better callstack so that we will be able to see where the issue actually occurs.

jdhao commented 7 years ago

@ysh329 , I met a similar problem when I tried to use MINISTIter, see https://github.com/dmlc/mxnet/issues/2270 for a reference.

KeyKy commented 7 years ago

met the same memory problem when training ssd or imagnet using .rec file.

szha commented 7 years ago

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks! Also, do please check out our forum (and Chinese version) for general "how-to" questions.