apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Too large max depth value in _recursive_fork_recordio #12619

Open caiqi opened 6 years ago

caiqi commented 6 years ago

It seems that 1000 is too large for _recursive_fork_recordio in https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataloader.py#L178 Even if len(obj.dict.items()) > 2, this function will be called by more than 2 ** 1000 times.

The following code in https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/ssd/train_ssd.py#L96 in gluon-cv will cause RecursionError: maximum recursion depth exceeded in comparison error on windows 10 with the latest build. I found that the reason is that there will be a HybridSequential object in the dataset object and the HybridSequential contains many children. This function is brought in commit #12554 . Is it ok to jump out of this function when obj is not an instance of mx.gluon.data.dataset.Dataset?

stu1130 commented 6 years ago

Thanks for submitting the issue @caiqi @mxnet-label-bot [data-loading]

eric-haibin-lin commented 6 years ago

@zhreshold

zhreshold commented 6 years ago

see https://github.com/apache/incubator-mxnet/pull/12622

Angzz commented 6 years ago

@zhreshold I change the code you commit, but the error still exits

zhreshold commented 6 years ago

@Angzz What os? Can you print this for me to debug?

import sys
print(sys.getrecursionlimit())
Angzz commented 6 years ago

@zhreshold ubuntu 16.04, I print the info you mention above with python2, and the output is 1000

zhreshold commented 6 years ago

Okay, I modified the search depth to be less aggressive.

Angzz commented 6 years ago

@zhreshold OK, I will update mxnet pre version to do a experiment, thanks

Angzz commented 6 years ago

when update to 1.3.1b20180925, error occurs when train ssd with coco, but voc is normal:

---------------- train log and error log ------------------

INFO:root:Start training from [Epoch 0] [19:54:19] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) [19:54:28] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) python: malloc.c:3722: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)' failed. *** Error inpython': malloc(): memory corruption: 0x00007fe3d29b3690 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe5b37c87e5] /lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fe5b37d313e] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fe5b37d5184] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x18)[0x7fe5af411e78] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x407eb0)[0x7fe52630beb0] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40d7c9)[0x7fe5263117c9] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2b88458)[0x7fe528a8c458] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2adcb29)[0x7fe5289e0b29] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6544)[0x7fe5289ea544] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2aea6c2)[0x7fe5289ee6c2] /home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6c64)[0x7fe5289eac64] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fe5af43cc80] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fe5b3b226ba] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe5b385841d] ======= Memory map: ======== 00400000-006de000 r-xp 00000000 103:02 16254177 /usr/bin/python2.7 008dd000-008de000 r--p 002dd000 103:02 16254177 /usr/bin/python2.7 008de000-00955000 rw-p 002de000 103:02 16254177 /usr/bin/python2.7 00955000-00978000 rw-p 00000000 00:00 0 00c8d000-a94d5000 rw-p 00000000 00:00 0 [heap] a94d5000-a9806000 rw-p 00000000 00:00 0 [heap] 200000000-200200000 rw-s 00000000 00:06 456 /dev/nvidiactl 200200000-200400000 ---p 00000000 00:00 0 200400000-200404000 rw-s 00000000 00:06 456 /dev/nvidiactl 200404000-200600000 ---p 00000000 00:00 0 200600000-200a00000 rw-s 00000000 00:06 456 /dev/nvidiactl 200a00000-201800000 ---p 00000000 00:00 0 201800000-201804000 rw-s 00000000 00:06 456 /dev/nvidiactl 201804000-201a00000 ---p 00000000 00:00 0 201a00000-201e00000 rw-s 00000000 00:06 456 /dev/nvidiactl 201e00000-201e04000 rw-s 00000000 00:06 456 /dev/nvidiactl 201e04000-202000000 ---p 00000000 00:00 0 202000000-202400000 rw-s 00000000 00:06 456 /dev/nvidiactl 202400000-202404000 rw-s 00000000 00:06 456 /dev/nvidiactl 202404000-202600000 ---p 00000000 00:00 0 202600000-202a00000 rw-s 00000000 00:06 456 /dev/nvidiactl 202a00000-202a04000 rw-s 00000000 00:06 456 /dev/nvidiactl 202a04000-202c00000 ---p 00000000 00:00 0 202c00000-203000000 rw-s 00000000 00:06 456 /dev/nvidiactl 203000000-203004000 rw-s 00000000 00:06 456 /dev/nvidiactl 203004000-203200000 ---p 00000000 00:00 0 203200000-203600000 rw-s 00000000 00:06 456 /dev/nvidiactl 203600000-203604000 rw-s 00000000 00:06 456 /dev/nvidiactl 203604000-203800000 ---p 00000000 00:00 0 203800000-203c00000 rw-s 00000000 00:06 456 /dev/nvidiactl 203c00000-203c04000 rw-s 00000000 00:06 456 /dev/nvidiactl 203c04000-203e00000 ---p 00000000 00:00 0 203e00000-204200000 rw-s 00000000 00:06 456 /dev/nvidiactl 204200000-204204000 rw-s 00000000 00:06 456 /dev/nvidiactl 204204000-204400000 ---p 00000000 00:00 0 204400000-204800000 rw-s 00000000 00:06 456 /dev/nvidiactl 204800000-204804000 rw-s 00000000 00:06 456 /dev/nvidiactl 204804000-204a00000 ---p 00000000 00:00 0 204a00000-204e00000 rw-s 00000000 00:06 456 /dev/nvidiactl 204e00000-204e04000 rw-s 00000000 00:06 456 /dev/nvidiactl 204e04000-205000000 ---p 00000000 00:00 0 205000000-205400000 rw-s 00000000 00:06 456 /dev/nvidiactl 205400000-205404000 rw-s 00000000 00:06 456 /dev/nvidiactl 205404000-205600000 ---p 00000000 00:00 0 205600000-205a00000 rw-s 00000000 00:06 456 /dev/nvidiactl 205a00000-205a04000 rw-s 00000000 00:06 456 /dev/nvidiactl 205a04000-205c00000 ---p 00000000 00:00 0 205c00000-206000000 rw-s 00000000 00:06 456 /dev/nvidiactl 206000000-206004000 rw-s 00000000 00:06 456 /dev/nvidiactl 206004000-206200000 ---p 00000000 00:00 0 206200000-206600000 rw-s 00000000 00:06 456 /dev/nvidiactl 206600000-206604000 rw-s 00000000 00:06 456 /dev/nvidiactl 206604000-206800000 ---p 00000000 00:00 0 206800000-206c00000 rw-s 00000000 00:06 456 /dev/nvidiactl 206c00000-206c04000 rw-s 00000000 00:06 456 /dev/nvidiactl 206c04000-206e00000 ---p 00000000 00:00 0 206e00000-207200000 rw-s 00000000 00:06 456 /dev/nvidiactl 207200000-207400000 ---p 00000000 00:00 0 207400000-207600000 rw-s 00000000 00:06 456 /dev/nvidiactl 207600000-207800000 rw-s 00000000 00:06 456 /dev/nvidiactl 207800000-207a00000 ---p 00000000 00:00 0 207a00000-207a04000 rw-s 00000000 00:06 456 /dev/nvidiactl 207a04000-207c00000 ---p 00000000 00:00 0 207c00000-208000000 rw-s 00000000 00:06 456 /dev/nvidiactl 208000000-208e00000 ---p 00000000 00:00 0 208e00000-208e04000 rw-s 00000000 00:06 456 /dev/nvidiactl 208e04000-209000000 ---p 00000000 00:00 0 209000000-209400000 rw-s 00000000 00:06 456 /dev/nvidiactl 209400000-209404000 rw-s 00000000 00:06 456 /dev/nvidiactl 209404000-209600000 ---p 00000000 00:00 0 209600000-209a00000 rw-s 00000000 00:06 456 /dev/nvidiactl 209a00000-209a04000 rw-s 00000000 00:06 456 /dev/nvidiactl 209a04000-209c00000 ---p 00000000 00:00 0 209c00000-20a000000 rw-s 00000000 00:06 456 /dev/nvidiactl 20a000000-20a004000 rw-s 00000000 00:06 456 /dev/nvidiactl

zhreshold commented 6 years ago

@Angzz Would disable these lines help? https://github.com/apache/incubator-mxnet/blob/29ac19124555ca838f5f3a01da638eda221b07b2/python/mxnet/gluon/data/dataloader.py#L181-L183

Are you using RecordFiles? If not, it has nothing to do with JPEG images.

Angzz commented 6 years ago

@zhreshold Sorry, I don't understand why delete these lines, if delete, the recursive mechanism will not work? I do not use the RecordFiles, just the images download by script gluoncv/datasets/mscoco.py. By the way, I find trouble always occur with coco but not voc, I doubt when image files up to a certain amount(just like coco), the multiprocess in dataloader will not work well(just like pytorch), it will become more aggressive. At last, thanks your reply and awesome job ^_^.

Angzz commented 6 years ago

when train to 13 epoch for coco, another error occurs:

[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error

Stack trace returned 10 entries: [bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a] [bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451] [bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd] [bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8] [bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c] [bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d] [bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414] [bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585] [bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8] [bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]

[13:44:22] src/engine/threaded_engine_perdevice.cc:99: Ignore CUDA Error [13:44:22] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error

Stack trace returned 10 entries: [bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a] [bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451] [bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297aea8) [0x7fadce976ea8] [bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987572) [0x7fadce983572] [bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8] [bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24] [bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3030221) [0x7fadcf02c221] [bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30302e2) [0x7fadcf02c2e2] [bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37d45a) [0x7fadcc37945a] [bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30345c9) [0x7fadcf0305c9]

[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error

Stack trace returned 10 entries: [bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a] [bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451] [bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd] [bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8] [bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c] [bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d] [bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414] [bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585] [bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8] [bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]

terminate called after throwing an instance of 'std::system_error' what(): Invalid argument Segmentation fault (core dumped)

Angzz commented 6 years ago

finally I solve this problem by this link: https://github.com/r9y9/gantts/issues/14 but I don't know why?

zhreshold commented 6 years ago

@Angzz Not sure why, maybe python related. However, it is not relevant to this thread. I am going to close this issue. Let me know if anyone is still getting the same original recursion error.

RuRo commented 5 years ago

Hi, I am getting a very similar error:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 178, in worker_loop
    _recursive_fork_recordio(dataset, 0, 1000)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  [Previous line repeated 970 more times]
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 166, in _recursive_fork_recordio
    if depth >= max_depth:
RecursionError: maximum recursion depth exceeded in comparison

I am using the latest cu90mkl docker (mxnet version 1.3.1). Unfortunately, I can't provide you with the exact code, because of legal reasons.

I have a custom class, that inherits from mxnet.gluon.data.Dataset. During the call to __getitiem__ a bunch of transforms are called. To speed this up, I've tried wrapping the transforms in a mxnet.gluon.data.vision.transforms.Compose, which broke the DataLoader.

Just applying the transforms sequentially works fine, but Composing them results in a RecursionError.

aaronmarkham commented 5 years ago

Reopening this issue since it looks like we have a public example now in the lipnet code that can be used to figure out what's going on...

Demohai commented 5 years ago

@RuRo has your problem solved?