Open caiqi opened 6 years ago
Thanks for submitting the issue @caiqi @mxnet-label-bot [data-loading]
@zhreshold
@zhreshold I change the code you commit, but the error still exits
@Angzz What os? Can you print this for me to debug?
import sys
print(sys.getrecursionlimit())
@zhreshold ubuntu 16.04, I print the info you mention above with python2, and the output is 1000
Okay, I modified the search depth to be less aggressive.
@zhreshold OK, I will update mxnet pre version to do a experiment, thanks
when update to 1.3.1b20180925, error occurs when train ssd with coco, but voc is normal:
---------------- train log and error log ------------------
INFO:root:Start training from [Epoch 0]
[19:54:19] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[19:54:28] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
python: malloc.c:3722: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)' failed. *** Error in
python': malloc(): memory corruption: 0x00007fe3d29b3690 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe5b37c87e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fe5b37d313e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fe5b37d5184]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x18)[0x7fe5af411e78]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x407eb0)[0x7fe52630beb0]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40d7c9)[0x7fe5263117c9]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2b88458)[0x7fe528a8c458]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2adcb29)[0x7fe5289e0b29]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6544)[0x7fe5289ea544]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2aea6c2)[0x7fe5289ee6c2]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6c64)[0x7fe5289eac64]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fe5af43cc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fe5b3b226ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe5b385841d]
======= Memory map: ========
00400000-006de000 r-xp 00000000 103:02 16254177 /usr/bin/python2.7
008dd000-008de000 r--p 002dd000 103:02 16254177 /usr/bin/python2.7
008de000-00955000 rw-p 002de000 103:02 16254177 /usr/bin/python2.7
00955000-00978000 rw-p 00000000 00:00 0
00c8d000-a94d5000 rw-p 00000000 00:00 0 [heap]
a94d5000-a9806000 rw-p 00000000 00:00 0 [heap]
200000000-200200000 rw-s 00000000 00:06 456 /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0
200400000-200404000 rw-s 00000000 00:06 456 /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0
200600000-200a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0
201800000-201804000 rw-s 00000000 00:06 456 /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0
201a00000-201e00000 rw-s 00000000 00:06 456 /dev/nvidiactl
201e00000-201e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
201e04000-202000000 ---p 00000000 00:00 0
202000000-202400000 rw-s 00000000 00:06 456 /dev/nvidiactl
202400000-202404000 rw-s 00000000 00:06 456 /dev/nvidiactl
202404000-202600000 ---p 00000000 00:00 0
202600000-202a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
202a00000-202a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
202a04000-202c00000 ---p 00000000 00:00 0
202c00000-203000000 rw-s 00000000 00:06 456 /dev/nvidiactl
203000000-203004000 rw-s 00000000 00:06 456 /dev/nvidiactl
203004000-203200000 ---p 00000000 00:00 0
203200000-203600000 rw-s 00000000 00:06 456 /dev/nvidiactl
203600000-203604000 rw-s 00000000 00:06 456 /dev/nvidiactl
203604000-203800000 ---p 00000000 00:00 0
203800000-203c00000 rw-s 00000000 00:06 456 /dev/nvidiactl
203c00000-203c04000 rw-s 00000000 00:06 456 /dev/nvidiactl
203c04000-203e00000 ---p 00000000 00:00 0
203e00000-204200000 rw-s 00000000 00:06 456 /dev/nvidiactl
204200000-204204000 rw-s 00000000 00:06 456 /dev/nvidiactl
204204000-204400000 ---p 00000000 00:00 0
204400000-204800000 rw-s 00000000 00:06 456 /dev/nvidiactl
204800000-204804000 rw-s 00000000 00:06 456 /dev/nvidiactl
204804000-204a00000 ---p 00000000 00:00 0
204a00000-204e00000 rw-s 00000000 00:06 456 /dev/nvidiactl
204e00000-204e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
204e04000-205000000 ---p 00000000 00:00 0
205000000-205400000 rw-s 00000000 00:06 456 /dev/nvidiactl
205400000-205404000 rw-s 00000000 00:06 456 /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0
205600000-205a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
205a00000-205a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
205a04000-205c00000 ---p 00000000 00:00 0
205c00000-206000000 rw-s 00000000 00:06 456 /dev/nvidiactl
206000000-206004000 rw-s 00000000 00:06 456 /dev/nvidiactl
206004000-206200000 ---p 00000000 00:00 0
206200000-206600000 rw-s 00000000 00:06 456 /dev/nvidiactl
206600000-206604000 rw-s 00000000 00:06 456 /dev/nvidiactl
206604000-206800000 ---p 00000000 00:00 0
206800000-206c00000 rw-s 00000000 00:06 456 /dev/nvidiactl
206c00000-206c04000 rw-s 00000000 00:06 456 /dev/nvidiactl
206c04000-206e00000 ---p 00000000 00:00 0
206e00000-207200000 rw-s 00000000 00:06 456 /dev/nvidiactl
207200000-207400000 ---p 00000000 00:00 0
207400000-207600000 rw-s 00000000 00:06 456 /dev/nvidiactl
207600000-207800000 rw-s 00000000 00:06 456 /dev/nvidiactl
207800000-207a00000 ---p 00000000 00:00 0
207a00000-207a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
207a04000-207c00000 ---p 00000000 00:00 0
207c00000-208000000 rw-s 00000000 00:06 456 /dev/nvidiactl
208000000-208e00000 ---p 00000000 00:00 0
208e00000-208e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
208e04000-209000000 ---p 00000000 00:00 0
209000000-209400000 rw-s 00000000 00:06 456 /dev/nvidiactl
209400000-209404000 rw-s 00000000 00:06 456 /dev/nvidiactl
209404000-209600000 ---p 00000000 00:00 0
209600000-209a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
209a00000-209a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
209a04000-209c00000 ---p 00000000 00:00 0
209c00000-20a000000 rw-s 00000000 00:06 456 /dev/nvidiactl
20a000000-20a004000 rw-s 00000000 00:06 456 /dev/nvidiactl
@Angzz Would disable these lines help? https://github.com/apache/incubator-mxnet/blob/29ac19124555ca838f5f3a01da638eda221b07b2/python/mxnet/gluon/data/dataloader.py#L181-L183
Are you using RecordFiles? If not, it has nothing to do with JPEG images.
@zhreshold Sorry, I don't understand why delete these lines, if delete, the recursive mechanism will not work? I do not use the RecordFiles
, just the images download by script gluoncv/datasets/mscoco.py
. By the way, I find trouble always occur with coco but not voc, I doubt when image files up to a certain amount(just like coco), the multiprocess in dataloader will not work well(just like pytorch), it will become more aggressive. At last, thanks your reply and awesome job ^_^.
when train to 13 epoch for coco, another error occurs:
[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error
Stack trace returned 10 entries: [bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a] [bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451] [bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd] [bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8] [bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c] [bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d] [bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414] [bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585] [bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8] [bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]
[13:44:22] src/engine/threaded_engine_perdevice.cc:99: Ignore CUDA Error [13:44:22] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error
Stack trace returned 10 entries: [bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a] [bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451] [bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297aea8) [0x7fadce976ea8] [bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987572) [0x7fadce983572] [bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8] [bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24] [bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3030221) [0x7fadcf02c221] [bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30302e2) [0x7fadcf02c2e2] [bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37d45a) [0x7fadcc37945a] [bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30345c9) [0x7fadcf0305c9]
[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error
Stack trace returned 10 entries: [bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a] [bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451] [bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd] [bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8] [bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c] [bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d] [bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414] [bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585] [bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8] [bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]
terminate called after throwing an instance of 'std::system_error' what(): Invalid argument Segmentation fault (core dumped)
finally I solve this problem by this link: https://github.com/r9y9/gantts/issues/14 but I don't know why?
@Angzz Not sure why, maybe python related. However, it is not relevant to this thread. I am going to close this issue. Let me know if anyone is still getting the same original recursion error.
Hi, I am getting a very similar error:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 178, in worker_loop
_recursive_fork_recordio(dataset, 0, 1000)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
_recursive_fork_recordio(v, depth + 1, max_depth)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
_recursive_fork_recordio(v, depth + 1, max_depth)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
_recursive_fork_recordio(v, depth + 1, max_depth)
[Previous line repeated 970 more times]
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 166, in _recursive_fork_recordio
if depth >= max_depth:
RecursionError: maximum recursion depth exceeded in comparison
I am using the latest cu90mkl
docker (mxnet version 1.3.1). Unfortunately, I can't provide you with the exact code, because of legal reasons.
I have a custom class, that inherits from mxnet.gluon.data.Dataset
. During the call to __getitiem__
a bunch of transforms are called. To speed this up, I've tried wrapping the transforms in a mxnet.gluon.data.vision.transforms.Compose
, which broke the DataLoader
.
Just applying the transforms sequentially works fine, but Composing them results in a RecursionError.
Reopening this issue since it looks like we have a public example now in the lipnet code that can be used to figure out what's going on...
@RuRo has your problem solved?
It seems that 1000 is too large for _recursive_fork_recordio in https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataloader.py#L178 Even if len(obj.dict.items()) > 2, this function will be called by more than 2 ** 1000 times.
The following code in https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/ssd/train_ssd.py#L96 in gluon-cv will cause
RecursionError: maximum recursion depth exceeded in comparison
error on windows 10 with the latest build. I found that the reason is that there will be a HybridSequential object in the dataset object and the HybridSequential contains many children. This function is brought in commit #12554 . Is it ok to jump out of this function when obj is not an instance of mx.gluon.data.dataset.Dataset?