IntelRealSense / librealsense

Intel® RealSense™ SDK
https://www.intelrealsense.com/
Apache License 2.0
7.59k stars 4.82k forks source link

rs2::pipeline::stop freezes/hangs/deadlocks sometimes #9184

Closed svebert closed 3 years ago

svebert commented 3 years ago

Required Info
Camera Model D415
Firmware Version 5.11.06.250
Operating System & Version Win10 Pro 10.0.19041 Build 19041
Kernel Version (Linux Only)
Platform PC Intel i5-9400 @ 2,9 GHZ 6 Kerne
SDK Version 2.42
Language C++
Segment others

Issue Description

I am debugging a very nasty issue with randomly disconnecting realsense d415 from the usb controller. On the test pc the realsense disconnects randomly and temporarily for unknown reason. You hear the typical windows "unplugging sound", the camera disappears in the device manager and shortly after reappears. When the camera disappears in the device manager, oPipeline.wait_for_frames(5000U) throws the error Frame didn't arrive within 5000. In case of this error, i stop the pipeline and restart it. This works like 3 or 4 times in a row and would solve my problem. But then after some sucessful restarts in the next restart, oPipeline.stop() just blocks and deadlocks. I tried some workarounds instead of stopping the pipeline I stopped and closed the underlying rs2::sensor objects. Here the same issue occurs. The call to oSensor.close() hangs/blocks infinitly. call oPipeline.stop() within deferred thread. But this crashes with unpredictable memory access violations later on, when trying to call oPipline.start() or sometimes, just randomly.

My issue is not the random disconnect on my test pc. My problem is, that I can't recover from it in some cases. How to stop and dispose all rs2 resources in a proper way? How to avoid this nasty infinity-blocking?

Here is my minimal example:

#include <iostream>
#include <librealsense2/rs.hpp>

int main(int /*argc*/, char* /*argv*/[ ])
{   
    try
    {

        rs2::context oRealSenseContext;
        int nRestartCounter = 0;
        while (nRestartCounter < 5)
        {
            std::cout << "config..." << std::endl;
            rs2::pipeline oPipeline{ oRealSenseContext };
            rs2::config oConfig;
            oConfig.enable_stream(RS2_STREAM_COLOR);
            oConfig.enable_stream(RS2_STREAM_DEPTH);
            std::cout << "starting pipeline..." << std::endl;
            rs2::pipeline_profile oProfile = oPipeline.start(oConfig);
            std::cout << "...ok" << std::endl;
            while (true)
            {
                bool bHaveError{ false };
                rs2::frameset oFrameset;
                try
                {
                    oFrameset = oPipeline.wait_for_frames(5000U);

                }
                catch (const rs2::error& crRS2Exception)
                {
                    std::cout << "capture loop: rs2 error " << crRS2Exception.what() << std::endl;
                    bHaveError = true;
                }
                catch (const std::exception& croException)
                {
                    std::cout << "capture loop: std error " << croException.what() << std::endl;
                    bHaveError = true;
                }
                catch (...)
                {
                    std::cout << "capture loop: unknonw error " << std::endl;
                    bHaveError = true;
                }

                if (bHaveError)
                {
                    break;
                }

                if (oFrameset.size() == 0)
                {
                    std::cout << "capture loop: empty frameset" << std::endl;
                }
                else
                {
                    std::cout << "capture loop: recieved frameset containing " << oFrameset.size() << " frames " << std::endl;
                }
            }
            std::cout << "stopping pipeline..." << std::endl;
            oPipeline.stop();
            std::cout << "...ok" << std::endl;
        }
    }
    catch (const rs2::error& crRS2Exception)
    {
        std::cout << "rs2 error " << crRS2Exception.what() << std::endl;
        return -1;
    }
    catch (const std::exception& croException)
    {
        std::cout << "Std error " << croException.what() << std::endl;
        return -2;
    }
    catch (...)
    {
        std::cout << "unknonw error " << std::endl;
        return -3;
    }

    return 0;
}

Realsense Viewer itsself shows also major reconnecting issues on this test pc. I have a USB3 connection and a fully USB3 compliant cable. (And yes, maybe the firmware is too old or there is a problem with the camera and/or the pc. But our company uses hundreds or even thousands of these cameras on various customer sites. I need to understand how to avoid the deadlock in the software. I don't care about broken cameras, but i don't want to have deadlocks when connecting a broken camera. It should just not work but not block to infinity.)

Any suggestions? Any fixes?

MartyG-RealSense commented 3 years ago

Hi @svebert Ideally a modern firmware would be used with SDK 2.42.0, but I understand the challenge of updating hundreds / thousands of cameras.

As a starting point in investigating your case, could you follow the steps below please.

STEP ONE Open the Device Manager of Windows whilst a camera is plugged in. Go to the View menu and select the menu option Show Hidden Devices.

image

If only one camera is plugged in then there should be only two drivers under the Cameras category of the Device Manager: RGB and Depth. There may be a multitude of hidden copies of RealSense drivers that are revealed by the View option though.

image

STEP TWO If there are multiple RealSense driver entries shown, right-click on each one individually and select the Uninstall device menu option to remove it from the Device Manager.

image

STEP THREE

When all RealSense drivers have been removed, unplug the camera from the USB port, wait a couple of seconds and plug it back in. Windows should automatically reinstall the RGB and Depth driver, and the Device Manager should then correctly show only one pair of drivers.

image


If there were multiple RealSense drivers revealed by the View option, does removing them improve the stability of your Windows test computer please?

svebert commented 3 years ago

Hi MartyG-RealSense!

I run through all steps. Unfortunatly this did not change the behaviour. After approx 6 restarts, the program hangs in oPipeline.stop()

MartyG-RealSense commented 3 years ago

I ran some further tests using the Device Manager. I found that the drivers disappear in the Device Manager if a Hardware Reset of the camera is performed in the RealSense Viewer . They should then return automatically to the Cameras category of the Device Manager after the reset. A reset of the camera has the same effect as unplugging it from the USB port and re-inserting it.

I found though that with one of the USB ports, if the Device Manager is open during the reset then the drivers would not reappear and the camera could not complete its reset in the Viewer - it would disappear from the options side panel but not return after reset. It would only reset successfully if the Device Manager was closed during the reset. The other USB 3 port on the same computer had no such problems though, with the camera returning in both the Device Manager and Viewer side-panel.

svebert commented 3 years ago

Hi Marty,

to state it clear. My problem is not, that the cameras appear or disappear in the device manager. The major problem is, that if this happens, the call of rs2::pipeline::stop can enter a deadlock state. And this is a state, which I cannot recover from. In the end, the whole application has to be killed and restarted.

MartyG-RealSense commented 3 years ago

My understanding is that if the camera disconnects during an active stream then the pipeline can automatically recover and continue without having to restart the pipeline if reconnection occurs within 5 seconds.

I have seen some past cases where an application freezes when the pipeline is closed. A solution that worked for some cases was to start the pipeline and stop the sensor instead of stopping the pipeline. I see that you have already tried this though.

Could you test whether it makes a difference if you use a pipeline.Close() instruction after pipeline.stop() please?

svebert commented 3 years ago

In the C++ API the pipeline object has no Close() member

MartyG-RealSense commented 3 years ago

In March 2021 a fix was added to the SDK for an infinite freeze after close problem with T265 that was very similar to yours. The fix was to perform a short sleep period before stopping the pipeline.

// Reproduces T265 Hand on Exit.
int main(int, char**)
{
    constexpr std::chrono::seconds timeout{ 1 };
    while (true)
    {
        // Start
        rs2::config config;
        rs2::pipeline pipeline;

        std::cout << "Entering pipeline.start()" << std::endl;
        pipeline.start();
        std::cout << "Exiting pipeline.start()" << std::endl;

        std::cout << "Sleeping for 1 second..." << std::endl;
        std::this_thread::sleep_for(timeout);

        std::cout << "Entering pipeline.stop()" << std::endl;
        pipeline.stop();
        std::cout << "Exiting pipeline.stop()" << std::endl;
    }
    return 0;
}
svebert commented 3 years ago

Hi Marty,

I just tried what you suggested above. Unfortunately, the fourth restart did result in the described deadlock. Here is my adjusted example code:

#include <iostream>
#include <chrono>
#include <thread>
#include <librealsense2/rs.hpp>

int main(int /*argc*/, char* /*argv*/[ ])
{   
    try
    {

        rs2::context oRealSenseContext;
        int nRestartCounter = 0;
        while (nRestartCounter < 100)
        {
            std::cout << "config..." << std::endl;
            rs2::pipeline oPipeline{ oRealSenseContext };
            rs2::config oConfig;
            oConfig.enable_stream(RS2_STREAM_COLOR);
            oConfig.enable_stream(RS2_STREAM_DEPTH);
            std::cout << "starting pipeline..." << std::endl;
            rs2::pipeline_profile oProfile = oPipeline.start(oConfig);
            std::cout << "...ok" << std::endl;
            while (true)
            {
                bool bHaveError{ false };
                rs2::frameset oFrameset;
                try
                {
                    oFrameset = oPipeline.wait_for_frames(5000U);

                }
                catch (const rs2::error& crRS2Exception)
                {
                    std::cout << "capture loop: rs2 error " << crRS2Exception.what() << std::endl;
                    bHaveError = true;
                }
                catch (const std::exception& croException)
                {
                    std::cout << "capture loop: std error " << croException.what() << std::endl;
                    bHaveError = true;
                }
                catch (...)
                {
                    std::cout << "capture loop: unknonw error " << std::endl;
                    bHaveError = true;
                }

                if (bHaveError)
                {
                    break;
                }

                if (oFrameset.size() == 0)
                {
                    std::cout << "capture loop: empty frameset" << std::endl;
                }
                else
                {
                    std::cout << "capture loop: recieved frameset containing " << oFrameset.size() << " frames " << std::endl;
                }
            }
            std::cout << "stopping pipeline..." << std::endl;
            using namespace std::chrono_literals;
            std::this_thread::sleep_for(1s);
            oPipeline.stop();
            std::cout << "...ok" << std::endl;
        }
    }
    catch (const rs2::error& crRS2Exception)
    {
        std::cout << "rs2 error " << crRS2Exception.what() << std::endl;
        return -1;
    }
    catch (const std::exception& croException)
    {
        std::cout << "Std error " << croException.what() << std::endl;
        return -2;
    }
    catch (...)
    {
        std::cout << "unknonw error " << std::endl;
        return -3;
    }

    return 0;
}
MartyG-RealSense commented 3 years ago

There has also been reports this year of a memory leak each time the pipeline is stopped. Conceivably this could lead to a freeze of the program after multiple stops if the computer's available memory capacity gets used up. This issue seemed more likely to occur on camera models with an IMU though (D435i / D455).

Would it be possible to monitor memory usage in the Task Manager interface of Windows (under its Performance tab) and see if available memory reduces significantly after each close?

image

svebert commented 3 years ago

Hi Marty-G,

I can't observe any memory leak. Occupied RAM is steady. There is a minimal increase in RAM on start() and a minimal decrease on stop().

MartyG-RealSense commented 3 years ago

I have seen some cases where a program fails if a break is included but works fine if there is not a break. Could you test whether it makes a difference if the break in your script is removed:

if (bHaveError)
{
break;
}
svebert commented 3 years ago

Hi Marty! I did test your suggestion.

I have seen some cases where a program fails if a break is included but works fine if there is not a break. Could you test whether it makes a difference if the break in your script is removed:

if (bHaveError)
{
break;
}

Here I could not observe the deadlock problem. But it is not a solution for me, because I want to restart the pipe, if an error occured in wait_for_frames. E.g. if you unplug the camera and re-plug it, the program should restart capturing.

I changed my example a little bit and did not break the loop on the first error Frame didn't arrive within 5000 but on the third. But this also eventually leads to deadlock in oPipeline.stop()

You mentioned a "Hardware reset" above. How could I trigger a hardware reset before the call of oPipeline.stop()?

MartyG-RealSense commented 3 years ago

The Python discussion in the link below explores testing a connection and initiating a hardware reset if Frame didn't arrive within 5000 occurs because of a freeze.

https://github.com/IntelRealSense/librealsense/issues/8393

svebert commented 3 years ago

Hi Marty,

I did further testing:

If I call the following block

std::cout << "hardware reset..." << std::endl;
auto dev_list = oRealSenseContext.query_devices();
for (auto dev : dev_list)
{
dev.hardware_reset();
}
using namespace std::chrono_literals;
std::this_thread::sleep_for(2s);
std::cout << "...ok" << std::endl;

before the oPipline.stop() call I get the same "deadlock" behaviour. It appears to happen even more often.

I am still testing, what happens, if I put this block after the oPipeline.stop() call.

MartyG-RealSense commented 3 years ago

Thanks very much @svebert - I look forward to hearing your test results from putting the block after oPipeline.stop()

svebert commented 3 years ago

The test did run over night. Unfortunately it again did freeze in the oPipeline.stop() function. So this does not help.

MartyG-RealSense commented 3 years ago

I looked through your script again. I am not aware of a past situation in which a letter has been placed in the wait_for_frames bracket (5000U). What is the reason for the 'U' please, and what happens if the U is removed?

svebert commented 3 years ago

The U stands for unsigned integer. There is no "U" passed into this function at runtime. It is just the C++ language, how you tell the compiler what data type the number before the letter is. There are also other valid letters, like "L" for long

MartyG-RealSense commented 3 years ago

Are you using the official 1 meter USB cable supplied with the camera or a longer cable of your own choice? I noted the mention of 'a fully USB3 compliant cable', which made me wonder about this.

svebert commented 3 years ago

It is 1m USB3 cable (it is not the supplied cable). But we tested a lot of cables and the cable is here not the problem. In the end, the disconnection is not my problem. My problem is the deadlock in the stop() command. That is (in my opinion) not a hardware related problem (also it occurs only randomly on my hardware). It is somthing fishy in the software. Even, if it is a firmware problem (I am not running the newest firmware), there should not happen this deadlock behaviour. I am ok with it, that my camera works unstable with the "old" firmware and the "weird" pc. I am not ok with the freezing/deadlock, from which I cannot recover without killing my application.

MartyG-RealSense commented 3 years ago

I located a past C++ case on Windows where a camera would become unresponsive and the only way to correct it was to unplug-replug the camera or to completely power off and reboot the PC with the power button (restarts did not work).

https://github.com/IntelRealSense/librealsense/issues/6397

In that case, I provided a link to information for using a Microsoft tool called DevCon to reset the entire USB port on a Windows computer instead of just resetting the camera. Resetting the entire port means that it is not necessary to detect the camera in order to do so.

https://github.com/IntelRealSense/librealsense/issues/6397#issuecomment-629381036

MartyG-RealSense commented 3 years ago

Hi @svebert Do you require further assistance with this case, please? Thanks!

svebert commented 3 years ago

The problem is not resolved, yet. I still wanted to test a variation of sleep + hardware-reset before and after the stop().

MartyG-RealSense commented 3 years ago

Okay, thanks for the update. Please do provide your tests results once you have them. Good luck!

shivak7 commented 3 years ago

@svebert I had a similar problem with pipeline.stop() hanging but in the case of corrupted bag files. I found that doing a hardware reset, followed by a pipe.start(cfg) and then immediately pipe.stop() managed to get out of the deadlock (it takes 2-3 seconds to exit probably because of the reset but it seems to work). I'm using this as a temporary workaround.

Not sure if it will work in your case with actual USB hardware but I thought it would be worth a shot.

svebert commented 3 years ago

@shivak7: Thank you. I will give it a shot. In the last days, the reconnecting issue nearly disappeared. Therefore, my program does not run frequently into the "Frame did not arrive within 5000" branch and thus not into the .stop() function. So I can't say, whether any workarounds with the hardware reset helps. There were two changes:

MartyG-RealSense commented 3 years ago

Thanks very much for the update @svebert - it's good to hear that your situation has improved!

MartyG-RealSense commented 3 years ago

Hi @svebert Do you require further assistance with this case, please? Thanks!

svebert commented 3 years ago

My problem is solved, for now. Unfortunately the root cause was never found. The cause was somewhere in the old firmware, loose contact of PC power supply and/or tightly calculated max current of power supply. Also a windows update may have fixed the issue.

MartyG-RealSense commented 3 years ago

Okay, thanks very much @svebert for the update! As your problem is solved for now, I will close the case. Feel free to open a new case if problems re-occur at a future date.