Crash in monocular SLAM only use case

lukes3315 commented 3 years ago

Hi,

Great work on this GPU based pose graph optimization.

I have been encountering issues with a pose graph constructed with monocular observations only and an error has been occurring:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: an illegal memory access was encountered

I reproduced consistently when building my pose graph without stereo edges.

To reproduce it with the given examples I just commented out the stereo edge loading in the readGraph method, which causes the aforementioned error.

Interestingly the crash isn't consistent, the memory error will only occur every n runs:

HSC RESIZE: 247 26127
HSC RESIZE: 247 247
cols: 26127 247
nnz: 20228 3477
---- OPTIMIZATION SUCCESS
HSC RESIZE: 247 26127
HSC RESIZE: 247 247
cols: 26127 247
nnz: 20228 3477
---- OPTIMIZATION SUCCESS
HSC RESIZE: 247 26127
HSC RESIZE: 247 247
cols: 26127 247
nnz: 20228 3477
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: an illegal memory access was encountered

Any idea if I am doing anything wrong or if the cuda bundle adjustment was not designed for having a non stereo edge based pose graph?

Thanks, Cheers.

lukes3315 commented 3 years ago

Nevermind, I think it is more of system level problem and conflict with other things using the GPU than a problem on the cuda-bundle-adjustment side.

I will post a fix when I find one.

Thanks, Cheers.

ghost commented 3 years ago

Nevermind, I think it is more of system level problem and conflict with other things using the GPU than a problem on the cuda-bundle-adjustment side.

I will post a fix when I find one.

Thanks, Cheers.

Hi @lukes3315 , I have met the exactly the same problem, is there any solutions updated? thanks.

CUDA 9.2 
Ubuntu 16.04

sorry, I have figured out. It is a OOM problem on my machine(GTX 1070 Ti), the sequence 07 demo is OK, while the sequence 00 is not.

sapir1779 commented 3 years ago

I get the same errors. Is the library not suited for such cases where there aren't any stereo edges? Too bad you've closed this

atakagi-fixstars commented 3 years ago

Hi,

I could reproduce that error in the following environment.

GTX 1050 Ti
commenting out the stereo edge loading
input ba_kitti_00.json

The error occurred in thrust::sort in findHschureMulBlockIndices.

void findHschureMulBlockIndices(const GpuHplBlockMat& Hpl, const GpuHscBlockMat& Hsc,
    GpuVec3i& mulBlockIds)
{
    const int block = 1024;
    const int grid = divUp(Hpl.cols(), block);

    DeviceBuffer<int> nindices(1);
    nindices.fillZero();

    findHschureMulBlockIndicesKernel<<<grid, block>>>(Hpl.cols(), Hpl.outerIndices(), Hpl.innerIndices(),
        Hsc.outerIndices(), Hsc.innerIndices(), mulBlockIds, nindices);
    CUDA_CHECK(cudaGetLastError());

    std::cout << "sort size: " << mulBlockIds.size() << std::endl;

    auto ptrSrc = thrust::device_pointer_cast(mulBlockIds.data());
    try
    {
        thrust::sort(ptrSrc, ptrSrc + mulBlockIds.size(), LessRowId());
    }
    catch (const thrust::system_error& e)
    {
        std::cout << e.what() << std::endl;
    }
}

When I run above code, then I get the error.

sort size: 294114
merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

I wonder why this happens in monocular only case. In "with stereo" case, the sort size is 2308093 and much larger than that in monocular only case, but the error didn't occur.

I will continue to investigate.

atakagi-fixstars commented 3 years ago

Hi,

I finally found the cause. The problem is not thrust::sort but the preceding DeviceBuffer initialization.

DeviceBuffer<int> nindices(1);
nindices.fillZero();

There was a bug of uninitialized value. In the above code, nindices is allocated memory but size was not set and still zero, so nindices.fillZero() had no effect on it.

The bug was fixed in https://github.com/fixstars/cuda-bundle-adjustment/commit/29d4b87e5120afa9b429a4e8b119113f1f4e555a, and the error don't occur. Thank you for your cooperation.

Regards,

ghost commented 3 years ago

@atakagi-fixstars

I have encountered another bug. In KITTI demo, when I set all landmark position fixed, I got a memory error:

i.e. change this line of code here:

// auto v = obj.create<cuba::LandmarkVertex>(id, Xw, fixed);
auto v = obj.create<cuba::LandmarkVertex>(id, Xw, 1);

Reading Graph... terminate called after throwing an instance of 'thrust::system::system_error'
  what():  scan failed to get memory buffer: invalid configuration argument

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

atakagi-fixstars commented 3 years ago

Hi, @carpdm

Thank you for notifying. We need to manage the issue in #7 .

Regards,

fixstars / cuda-bundle-adjustment

Crash in monocular SLAM only use case #3