apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Bug in bipartite_matching #13074

Closed wshuail closed 5 years ago

wshuail commented 5 years ago

Hi when I use SSD for my own dataset, I found a bug in mx.sym.contrib.bipartite_matching for mxnet 1.3.x.

it's easy to reproduce with the code as below. But maybe you have to try many times.

import mxnet as mx from mxnet import nd

for _ in range(10000): x = nd.random.uniform(0, 1, (10, 100)) output = nd.contrib.bipartite_matching(data=x, threshold=1e-12, is_ascend=False)

The error informations change sometimes, but generally it's about the memory. sometimes, it's memory corruption. Sometimes like the message below.

Error in `python': double free or corruption (out): 0x00007f246400ce40 ======= Backtrace: ========= /lib64/libc.so.6(+0x7d053)[0x7f25523dc053] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet2op9SortByKeyIifEEvN7mshadow6TensorINS2_3cpuELi1ET_EENS3_IS4_Li1ET0_EEbPNS3_IS4_Li1EcEEii+0x3 62)[0x7f24aaf5f2d2] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet2op24BipartiteMatchingForwardIN7mshadow3cpuEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vector INS_5TBlobESaISC_EERKSB_INS_9OpReqTypeESaISHEESG+0x129f)[0x7f24aaf63cff] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS _5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS 9_IjSaIjEESJ_ENKUlNS_10RunContextEEclES1G+0x2e8)[0x7f24ab33ac18] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3b47dc9)[0x7f24ab754dc9] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x599)[0x7f24ab7503d 9] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN4dmlc11ManualEventEEEZZN5mxnet6engine23ThreadedEnginePerDev ice13PushToExecuteEPNS6_8OprBlockEbENKUlvE_clEvEUlS3_E_E9_M_invokeERKSt9_AnydataS3+0xd2)[0x7f24ab760f32] /home/xxx/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN4dmlc11ManualEventEEEES6_EEE6_M_run Ev+0x44)[0x7f24ab74fd14] /lib64/libstdc++.so.6(+0xb5220)[0x7f253a198220] /lib64/libpthread.so.0(+0x7dc5)[0x7f2552e31dc5] /lib64/libc.so.6(clone+0x6d)[0x7f2552455ced]

This doesn't happen in mxnet 1.1.0, but gluoncv needs mxnet 1.3.0.

frankfliu commented 5 years ago

@mxnet-label-bot [Bug, Operator]

vuvko commented 5 years ago

Can confirm this problem on CPU. On GPU everything is fine.

import mxnet as mx
from mxnet import nd

for _ in range(10):
    x = nd.random.uniform(0, 1, (10, 10))
    output = nd.contrib.bipartite_matching(data=x, threshold=1e-12, is_ascend=False)

results in free(): invalid pointer. While

import mxnet as mx
from mxnet import nd

for _ in range(10):
    x = nd.random.uniform(0, 1, (10, 10), ctx=mx.gpu(0))
    output = nd.contrib.bipartite_matching(data=x, threshold=1e-12, is_ascend=False)

seems to works just fine

wshuail commented 5 years ago

https://github.com/apache/incubator-mxnet/pull/13727

https://github.com/dmlc/gluon-cv/issues/529

This was fixed already. Thx.