GPS driver occupied with writing results in GPS loss

Stifael commented 5 years ago

Describe the bug When using GPS RTK, the gps-driver injects data from the Base (=ground-unit) to the Rover (=in-air unit). If there is a data burst from the Base, then the gps-driver is occupied with writing to the Rover and no more data is read from the Rover, which results in GPS loss.

To Reproduce Hardware setup with in-air gps device (no base station is required) and able to access System Console.

Clone external_repo
Go to PX4-firmware and clean the build.
```
cd <px4-src-dir>
make clean
```
add PX4_INFO
- here: PX4_INFO("written: %d", written);
- here: PX4_INFO("read: %d", num_read);

Build your target and upload. Include the external_module-repo

make <target>  EXTERNAL_MODULES_LOCATION=<path to external_module> upload

Open System console and turn on the drone: You should see the gps-read print that we added above.
Open mavlink-shell
```
./Tools/mavlink_shell.sh
```
start gps_injeciton_app from the mavlink-shell:
```
nsh> gps_injection_app start
```
The gps_injection_app sends fake gps_inject_data-msg for simulating a data burst. In the System console, you will see that the gps-driver will be occupied with writing without having any read in between. On a real setup, this results in GPS-device loss.

Expected behavior Reading from the gps-device should continue despite writing to the device.

Additional context A workaround that fixed the issue (temporary solution): Read data from the rover for every predefined amount of data written to the rover.

potaito commented 5 years ago

We also identified the exact lines of code that cause the problem:

https://github.com/PX4/Firmware/blob/3d271245a1052ea612b88a4ddf871286356cba94/src/drivers/gps/gps.cpp#L448-L463

When uORB messages come in faster than they can be written to the GPS sensor, the driver will remain stuck in this while-loop and not perform a single read operation during that time.

There are two quick and dirty hotfix options:

We limit the maximum duration that the GPS driver can spend inside the while loop.
We limit the maximum number of messages, i.e. max number of while-loop executions.

I can implement and propose a fix if we can agree on a good solution. We already tested option 2 by modifying the while loop to execute at most 6 times. "6" because one set of injection data can contain multiple messages that belong together. One for each satellite system "GPS", GLONASS, Galileo and Baidu. Unfortunately we cannot inspect the uORB message to determine which messages belong together because of the message structure not providing much information. Everything is hidden in one big binary array:

https://github.com/PX4/Firmware/blob/3d271245a1052ea612b88a4ddf871286356cba94/msg/gps_inject_data.msg#L1-L4

My opinion on both options I mentioned above:

Limit the duration: Provides a more stable performance
Limit the maximum number of consequent messages: Can guarantee, that at least one set of injection data is transmitted completely.

bkueng commented 5 years ago

Thanks for the detailed investigations & report. Limiting the time spent in the loop is a viable approach. I have another one that does not depend on time:

--- a/src/drivers/gps/gps.cpp
+++ b/src/drivers/gps/gps.cpp
@@ -448,6 +448,16 @@ void GPS::handleInjectDataTopic()
            injectData(msg.data, msg.len);

            ++_last_rate_rtcm_injection_count;
+
+#ifdef __PX4_NUTTX
+           unsigned buf_free = 0;
+
+           if (ioctl(_serial_fd, FIONSPACE, (unsigned long)&buf_free) == 0 && buf_free < sizeof(msg.data)) {
+               // Return if we risk having too little remaining buffer left to avoid blocking and delaying reads.
+               break;
+           }
+
+#endif
        }
    } while (updated);
 }

Can you test if that works?

Stifael commented 5 years ago

I tested it and works as expected. The driver will read after every write operation and no gps-loss is detected. One other question: where is the size of the serial buffer defined?

bkueng commented 5 years ago

One other question: where is the size of the serial buffer defined?

That is a NuttX config, depending on the UART that is used, it's one of CONFIG_UART4_TXBUFSIZE. Usually 300.

Stifael commented 5 years ago

ok thanks. For us it is 256, and if that buffer is always around 300, then there will always be one read per every write (msg.data = 180). Let me test a bit more on an actual setup, but looks promising.

bkueng commented 5 years ago

We just need to ensure that we don't drop messages now. I suggest you check that the rate RTCM injection is still the same.

Stifael commented 5 years ago

It seems the testing went well, but I am still not entirely convinced. Let me go through the current logic to make sure that I understand everything correctly:

Lets assume that an entire RTCM 3x message consists of 300 bytes. The mavlink gps-rtcm data-msg can be max 180 bytes, and the uorb-msg gps_inject_data is 182 bytes. Serial buffer is 256. This means that each original RTCM 3x message needs to be split into 2 mavlink-msgs. In order for the rover to have a valid reference data, it needs to receive two sequential uorb msgs. If we now have a data burst, then with the buffer-solution every second uorb-msg is dropped because only one urob-message has room in the buffer. This means that the rover will never receive a valid reference data...

By the way, 300 bytes is just an example. In reality a RTCM full package can many more bytes depending on the amount of satellites available.

bkueng commented 5 years ago

So the reason why this works is that handleInjectDataTopic is called at least with almost 20 Hz (because of this). 300 bytes is a reasonable upper limit, and also depends on the RTCM message type. In practice I generally see an injection rate of around 5 Hz. Is that different in your case?

potaito commented 5 years ago

I have another one that does not depend on time:

--- a/src/drivers/gps/gps.cpp
+++ b/src/drivers/gps/gps.cpp
@@ -448,6 +448,16 @@ void GPS::handleInjectDataTopic()
          injectData(msg.data, msg.len);

          ++_last_rate_rtcm_injection_count;
+
+#ifdef __PX4_NUTTX
+         unsigned buf_free = 0;
+
+         if (ioctl(_serial_fd, FIONSPACE, (unsigned long)&buf_free) == 0 && buf_free < sizeof(msg.data)) {
+             // Return if we risk having too little remaining buffer left to avoid blocking and delaying reads.
+             break;
+         }
+
+#endif
      }
  } while (updated);
 }

Can you test if that works?

I think the only problem with this the location of the new lines of code. With the suggested logic orb_copy() is called, and then the buffer is checked and if it's full we don't write. So with the break statement, the uorb message is lost since the scope is left.

Alternatively we simply move the suggested buffer check between orb_check() and orb_copy() and only do the latter if we are also going to actually write the data. Then no message will be dropped. Unless of course the uorb queue fills up.

Maybe we should open a PR based on your suggestion and discuss the solution there.

bkueng commented 5 years ago

With the suggested logic orb_copy() is called, and then the buffer is checked and if it's full we don't write.

No, since injectData is called directly after the orb_copy, and only then we check the buffer.

Stifael commented 5 years ago

It is now confirmed that the buffer-solution will trigger a gps-loss as well if there is a data-burst present.

see an injection rate of around 5 Hz. Is that different in your case

There are two issues right now that we experience:

1. data burst

It seems that sometimes there is more data injected than usual as if there is a buffer somewhere. We don't have a final conclusion yet which part of the transport-layer causes that buffer.

2. no gps read because of gps rtcm write

This problem can happen based on 1 and is what we are discussing in this issue. If there is no data burst, then it works even with the current implementation, and in fact the current solution is even superior over the buffer-solution because it guarantees that all packages are received. The reason follows below.

Current state: With 16 satellites and 4 measurements a full data package has 5929 bits (~741 bytes). This package has to be received by the rover as a whole (same base timestamp). As I described above, the mavlink-msg is 180 bytes and consequently the message needs to be fragmented. The buffer-solution as well as the time-dependent solution does not take care of that. The mavlink rtcm message contains a flag, which indicates if the data has been split into several mavlink-msgs and gives info about the sequence ID. In the description of that flag, it says the following: Messages are only to be flushed to the GPS when the entire message has been reconstructed on the autopilot

I don't see where the Autopilot is currently reconstructing anything based on that flag. During normal operation (no data-burst, 5Hz), the current implementation with having higher priority set to the write operation ensures that all data will be flushed to the rover and hence the flag can be neglected. That is why the buffer-solution is actually worse than what we have right now because the buffer-solution does not guarantee that the full package is received by the rover. The same applies to time-dependent package drop.

To me the right solution is to flush the gps-device not based on buffer size, but based on the mavlink msg flag.

cody1028 commented 5 years ago

@Stifael @potaito I have an idea. Just limit publish rate of the RTCM mavlink message handler. Because the mavlink parsing buffer is bigger. In mavlink side . It will skip some publish when it get RTCM msg to frequently.

Stifael commented 5 years ago

I need to check, but I thought that the rtcm mavlink msg is sent with 5 Hz. The problem with the data-burst could also be caused by wifi

jiejiezhang commented 5 years ago

I asked Joe and Chenli, they can limit the amount of RTCM data bursting on the DP side due to poor wifi signal. Maybe this is a workaround method, but we still need to find a solution to this reason.

tdnet12434 commented 5 years ago

Hi, i have this problem too. Anyone working on this? Thank!

xdwgood commented 5 years ago

this issue from RTK module..

potaito commented 5 years ago

Hi, i have this problem too. Anyone working on this? Thank!

On our private PX4 fork we are still using a hotfix, where we execute the loop at most 6 times, i.e. read at most 6 messages at once because that always works.

We have to do more testing with @bkueng 's suggested solution https://github.com/PX4/Firmware/issues/11716#issuecomment-476126104

tdnet12434 commented 5 years ago

Hi, i have this problem too. Anyone working on this? Thank!

On our private PX4 fork we are still using a hotfix, where we execute the loop at most 6 times, i.e. read at most 6 messages at once because that always works.

We have to do more testing with @bkueng 's suggested solution #11716 (comment)

Thank you @potaito . Can you give me more information about this hotfix? Where this hotfix is applied? on gps reader or mavlink receiver? Thank you advanced :)

potaito commented 5 years ago

We limited the number of times the following do-while loop can be executed to 6 iterations:

https://github.com/PX4/Firmware/blob/3d271245a1052ea612b88a4ddf871286356cba94/src/drivers/gps/gps.cpp#L448-L463

But it would also be great if you can test @bkueng's proposed solution.

tdnet12434 commented 5 years ago

Thank you for reply. I have tested my version that maybe similar to @bkueng solution. Unfortunately, the GPS still lost when communication is unstable as mentioned by @Stifael. Here is my tested version.

unsigned
GPS::get_free_tx_buf()
{
    /*
     * Check if the OS buffer is full and disable HW
     * flow control if it continues to be full
     */
    int buf_free = 0;

    (void) ioctl(_serial_fd, FIONSPACE, (unsigned long)&buf_free);

    return buf_free;
}

bool GPS::injectData(uint8_t *data, size_t len)
{
    if(len > get_free_tx_buf())
    {

        PX4_WARN("Not enough txspace for gps left %d need %d", get_free_tx_buf(), len);
        return false;
    }
    dumpGpsData(data, len, true);

    size_t written = ::write(_serial_fd, data, len);
    ::fsync(_serial_fd);

    return written == len;
}

However, i will test your hotfix as soon as i can and let you know about result. regards.

tdnet12434 commented 5 years ago

I end up with this solution for now. Just limit do-while loop to 6 as @potaito said and limit writing rate correspond to baudrate this mean we can inject each RTCM msg to GPS module around 15ms at 115200 bps. As far as i test, in normal case, lastest 4-5/6 msgs burst (182*5 bytes) can be sequentially delivered to GPS without GPS loss.

bool updated = false;

    uint8_t quota = 0;
    do {
        orb_check(_orb_inject_data_fd, &updated);

        static hrt_abstime last_inject=hrt_absolute_time();

        if (updated) {

            static unsigned last_msg_len_injected = 182;
            unsigned character_count = last_msg_len_injected;
            static unsigned baudrate = _baudrate == 0 ? 115200 : _baudrate;
            static unsigned sleeptime = character_count * 1000000 / (baudrate / 10);

            if(hrt_elapsed_time(&last_inject) < sleeptime) {
                break;
            }
            last_inject = hrt_absolute_time();

            struct gps_inject_data_s msg;
            orb_copy(ORB_ID(gps_inject_data), _orb_inject_data_fd, &msg);

            if(msg.len > get_free_tx_buf())
            {
                PX4_WARN("Not enough txspace for gps left %d need %d", get_free_tx_buf(), msg.len);
                break;
            }

            last_msg_len_injected = msg.len;

            /* Write the message to the gps device. Note that the message could be fragmented.
             * But as we don't write anywhere else to the device during operation, we don't
             * need to assemble the message first.
             */
            injectData(msg.data, msg.len);

            ++_last_rate_rtcm_injection_count;
            quota++;
        }
    } while (updated && quota < 6);

tdnet12434 commented 5 years ago

I can confirm this work well in my swarm drone system.

potaito commented 5 years ago

@bkueng Should I open a Pull-Request to limit the time (actually, iterations) spent in the while-loop? It is a hacky solution, but still better than GPS timing out because the driver is choking :)

It was working fine for us for the past months as well as for @tdnet12434 it seems.

bkueng commented 5 years ago

Alright, let's do that.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

bkueng commented 4 years ago

https://github.com/PX4/Firmware/pull/12710 was merged.

PX4 / PX4-Autopilot

GPS driver occupied with writing results in GPS loss #11716

1. data burst

2. no gps read because of gps rtcm write