apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

How to deal with huge image record file #1411

Closed LiamZhuuu closed 8 years ago

LiamZhuuu commented 8 years ago

For I am training on a 11 GB image record file and always get killed because of out of memory. So I am just wondering that does mxnet load the entire image into the memory? And what can I do if so. Thanks!

piiswrong commented 8 years ago

What does your trianing script look like and what error are you seeing?

LiamZhuuu commented 8 years ago

pretty standard one. data is loaded with mx.io.ImageRecordIter the Inception_BN is fine-tuned using Feedforward.fit And the log shows that

[3108734.905849] Out of memory: Kill process 10617 (python) score 195 or sacrifice child
[3108734.907546] Killed process 10617 (python) total-vm:147094076kB, anon-rss:15868116kB, file-rss:344944kB

My desktop has 3 titanX cards and 32 GB memory (but only 16 GB free for training)

piiswrong commented 8 years ago

When does this happen? At the start of training or after a while?

piiswrong commented 8 years ago

imagerecorditer definitely own't load the entire record file into memory. This should be caused by something else

LiamZhuuu commented 8 years ago

At the beginning of the training. But I find that the training takes more then 10 GB memory beside graphic memory. So what's the usage of that part of memory ( My batch size is 384, when I switch to 256, it get killed after one epoch)

LiamZhuuu commented 8 years ago

Update, I upgrade the memory and now the traning won't get killed. And I just wondering when the training will take so much memory (17GB)

piiswrong commented 8 years ago

@tqchen Any thoughts?

tqchen commented 8 years ago

There is a prefetcher buffer in mxnet. MXNet of course won't load the entire dataset into memory, but this could due to too large prefetcher buffers. Try change the prefetch_buffer parameter here http://mxnet.readthedocs.org/en/latest/python/io.html

Since there are multiple places where prefetching, and multi-threaded decoding is happening. Some of them are not yet configurable. e.g.

https://github.com/dmlc/mxnet/blob/master/src/io/iter_image_recordio.cc#L324

@piiswrong can you confirm the memory consumption on your side normally as well?

fengjunlv commented 8 years ago

I am having the same problem. Today I updated and recompiled the code to the latest version but now training uses almost 31GB (inception-bn model, batch_size=128) which slows down my ubuntu OS significantly and the speed is less than 20 samples/sec. It used to be 56 samples/sec a couple of days ago.

piiswrong commented 8 years ago

@fengjunlv Thanks for reporting the issue. Could you do a bisect "https://git-scm.com/docs/git-bisect" and pin point the change that caused this?

tqchen commented 8 years ago

could due to my recent change to dmlc-core to increase the ingestion buffer size. I know reverted dmlc-core back to the same setting as in this commit https://github.com/dmlc/dmlc-core/commit/2d1db230e5fb0bf32bfed4e425b6fe20d05aa1c3

@piiswrong please send a PR to update dmlc-core and see if this issue is fixed

fengjunlv commented 8 years ago

@piiswrong Unfortunately, I am not sure which version it is that used to be working. It was pulled and compiled quite a few days ago. Hopefully what tqchen has found out will fix the issue.

piiswrong commented 8 years ago

@tqchen doesn't seem to fix it on my machine. Training still takes 16G ram

tqchen commented 8 years ago

@piiswrong I see, can you look a bit into what happened?

piiswrong commented 8 years ago

I tried, but couldn't find a working version to start bisecting. Memory consumption starts when ImageRecordIter is created, before training starts. So you shouldn't need gpu to debug this.

LiamZhuuu commented 8 years ago

@tqchen Hi,thx for the suggestion. But the way you propose cannot significantly reduce the memory usage.

erogol commented 8 years ago

Same problem is here. I use prefetch_buffer = 1 but still my data fills all the way of 30GB memory then raises out of memory error.

I also have one old version of mxnet and it works with no problem. It seems something with a new update disrupt the code.

ghost commented 8 years ago

The newest version seems does not require so much memory. But after updating to the newest version. I got a problem

src/io/iter_image_recordio.cc:56: Check failed: p != end Bad ImageList format.

My ImageRecordIter are defined as follows:

batch_size = 32

train_dataiter = mx.io.ImageRecordIter(
        mean_r = 123.68,
        mean_g = 116.779,
        mean_b = 103.939,
        shuffle=True,
        path_imgrec="/home/marchhare/data/train.rec",
        path_imglist="/home/marchhare/Desktop/train.lst",
        rand_crop=True,
        rand_mirror=True,
        data_shape=(3,256,256),
        batch_size=batch_size,
        label_width=7)

test_dataiter = mx.io.ImageRecordIter(
        mean_r = 123.68,
        mean_g = 116.779,
        mean_b = 103.939,
        path_imgrec="/home/marchhare/data/test.rec",
        path_imglist="/home/marchhare/Desktop/test.lst",
        rand_crop=False,
        rand_mirror=False,
        data_shape=(3,256,256),
        batch_size=batch_size,
        round_batch=False,
        label_width=7) 

Am I missing something in the multi-label for a Single Image? Thanks.

fengjunlv commented 8 years ago

@piiswrong and @tqchen Now it only takes ~9G CPU RAM in my case. Problem solved. Awesome. Thanks.