Closed Stifael closed 4 years ago
We also identified the exact lines of code that cause the problem:
When uORB messages come in faster than they can be written to the GPS sensor, the driver will remain stuck in this while-loop and not perform a single read operation during that time.
There are two quick and dirty hotfix options:
I can implement and propose a fix if we can agree on a good solution. We already tested option 2 by modifying the while loop to execute at most 6 times. "6" because one set of injection data can contain multiple messages that belong together. One for each satellite system "GPS", GLONASS, Galileo and Baidu. Unfortunately we cannot inspect the uORB message to determine which messages belong together because of the message structure not providing much information. Everything is hidden in one big binary array:
My opinion on both options I mentioned above:
Thanks for the detailed investigations & report. Limiting the time spent in the loop is a viable approach. I have another one that does not depend on time:
--- a/src/drivers/gps/gps.cpp
+++ b/src/drivers/gps/gps.cpp
@@ -448,6 +448,16 @@ void GPS::handleInjectDataTopic()
injectData(msg.data, msg.len);
++_last_rate_rtcm_injection_count;
+
+#ifdef __PX4_NUTTX
+ unsigned buf_free = 0;
+
+ if (ioctl(_serial_fd, FIONSPACE, (unsigned long)&buf_free) == 0 && buf_free < sizeof(msg.data)) {
+ // Return if we risk having too little remaining buffer left to avoid blocking and delaying reads.
+ break;
+ }
+
+#endif
}
} while (updated);
}
Can you test if that works?
I tested it and works as expected. The driver will read after every write operation and no gps-loss is detected. One other question: where is the size of the serial buffer defined?
One other question: where is the size of the serial buffer defined?
That is a NuttX config, depending on the UART that is used, it's one of CONFIG_UART4_TXBUFSIZE
. Usually 300.
ok thanks. For us it is 256, and if that buffer is always around 300, then there will always be one read per every write (msg.data = 180). Let me test a bit more on an actual setup, but looks promising.
We just need to ensure that we don't drop messages now. I suggest you check that the rate RTCM injection
is still the same.
It seems the testing went well, but I am still not entirely convinced. Let me go through the current logic to make sure that I understand everything correctly:
Lets assume that an entire RTCM 3x message consists of 300 bytes. The mavlink gps-rtcm
data-msg can be max 180 bytes, and the uorb-msg gps_inject_data
is 182 bytes. Serial buffer is 256.
This means that each original RTCM 3x message needs to be split into 2 mavlink-msgs. In order for the rover to have a valid reference data, it needs to receive two sequential uorb msgs.
If we now have a data burst, then with the buffer-solution
every second uorb-msg is dropped because only one urob-message has room in the buffer. This means that the rover will never receive a valid reference data...
By the way, 300 bytes is just an example. In reality a RTCM full package can many more bytes depending on the amount of satellites available.
So the reason why this works is that handleInjectDataTopic
is called at least with almost 20 Hz (because of this). 300 bytes is a reasonable upper limit, and also depends on the RTCM message type. In practice I generally see an injection rate of around 5 Hz. Is that different in your case?
I have another one that does not depend on time:
--- a/src/drivers/gps/gps.cpp +++ b/src/drivers/gps/gps.cpp @@ -448,6 +448,16 @@ void GPS::handleInjectDataTopic() injectData(msg.data, msg.len); ++_last_rate_rtcm_injection_count; + +#ifdef __PX4_NUTTX + unsigned buf_free = 0; + + if (ioctl(_serial_fd, FIONSPACE, (unsigned long)&buf_free) == 0 && buf_free < sizeof(msg.data)) { + // Return if we risk having too little remaining buffer left to avoid blocking and delaying reads. + break; + } + +#endif } } while (updated); }
Can you test if that works?
I think the only problem with this the location of the new lines of code. With the suggested logic orb_copy()
is called, and then the buffer is checked and if it's full we don't write. So with the break
statement, the uorb message is lost since the scope is left.
Alternatively we simply move the suggested buffer check between orb_check()
and orb_copy()
and only do the latter if we are also going to actually write the data. Then no message will be dropped. Unless of course the uorb queue fills up.
Maybe we should open a PR based on your suggestion and discuss the solution there.
With the suggested logic orb_copy() is called, and then the buffer is checked and if it's full we don't write.
No, since injectData
is called directly after the orb_copy
, and only then we check the buffer.
It is now confirmed that the buffer-solution
will trigger a gps-loss as well if there is a data-burst present.
see an injection rate of around 5 Hz. Is that different in your case
There are two issues right now that we experience:
It seems that sometimes there is more data injected than usual as if there is a buffer somewhere. We don't have a final conclusion yet which part of the transport-layer causes that buffer.
This problem can happen based on 1
and is what we are discussing in this issue. If there is no data burst, then it works even with the current implementation, and in fact the current solution is even superior over the buffer-solution
because it guarantees that all packages are received. The reason follows below.
Current state:
With 16 satellites and 4 measurements a full data package has 5929 bits (~741 bytes). This package has to be received by the rover as a whole (same base timestamp). As I described above, the mavlink-msg is 180 bytes and consequently the message needs to be fragmented. The buffer-solution
as well as the time-dependent solution
does not take care of that.
The mavlink rtcm message contains a flag, which indicates if the data has been split into several mavlink-msgs and gives info about the sequence ID. In the description of that flag, it says the following:
Messages are only to be flushed to the GPS when the entire message has been reconstructed on the autopilot
I don't see where the Autopilot is currently reconstructing anything based on that flag. During normal operation (no data-burst, 5Hz), the current implementation with having higher priority set to the write
operation ensures that all data will be flushed to the rover and hence the flag can be neglected. That is why the buffer
-solution is actually worse than what we have right now because the buffer-solution does not guarantee that the full package is received by the rover. The same applies to time-dependent package drop.
To me the right solution is to flush the gps-device not based on buffer size, but based on the mavlink msg flag.
@Stifael @potaito I have an idea. Just limit publish rate of the RTCM mavlink message handler. Because the mavlink parsing buffer is bigger. In mavlink side . It will skip some publish when it get RTCM msg to frequently.
I need to check, but I thought that the rtcm mavlink msg is sent with 5 Hz.
The problem with the data-burst
could also be caused by wifi
I asked Joe and Chenli, they can limit the amount of RTCM data bursting on the DP side due to poor wifi signal. Maybe this is a workaround method, but we still need to find a solution to this reason.
Hi, i have this problem too. Anyone working on this? Thank!
this issue from RTK module..
Hi, i have this problem too. Anyone working on this? Thank!
On our private PX4 fork we are still using a hotfix, where we execute the loop at most 6 times, i.e. read at most 6 messages at once because that always works.
We have to do more testing with @bkueng 's suggested solution https://github.com/PX4/Firmware/issues/11716#issuecomment-476126104
Hi, i have this problem too. Anyone working on this? Thank!
On our private PX4 fork we are still using a hotfix, where we execute the loop at most 6 times, i.e. read at most 6 messages at once because that always works.
We have to do more testing with @bkueng 's suggested solution #11716 (comment)
Thank you @potaito . Can you give me more information about this hotfix? Where this hotfix is applied? on gps reader or mavlink receiver? Thank you advanced :)
We limited the number of times the following do-while loop can be executed to 6 iterations:
But it would also be great if you can test @bkueng's proposed solution.
Thank you for reply. I have tested my version that maybe similar to @bkueng solution. Unfortunately, the GPS still lost when communication is unstable as mentioned by @Stifael. Here is my tested version.
unsigned
GPS::get_free_tx_buf()
{
/*
* Check if the OS buffer is full and disable HW
* flow control if it continues to be full
*/
int buf_free = 0;
(void) ioctl(_serial_fd, FIONSPACE, (unsigned long)&buf_free);
return buf_free;
}
bool GPS::injectData(uint8_t *data, size_t len)
{
if(len > get_free_tx_buf())
{
PX4_WARN("Not enough txspace for gps left %d need %d", get_free_tx_buf(), len);
return false;
}
dumpGpsData(data, len, true);
size_t written = ::write(_serial_fd, data, len);
::fsync(_serial_fd);
return written == len;
}
However, i will test your hotfix as soon as i can and let you know about result. regards.
I end up with this solution for now. Just limit do-while loop to 6 as @potaito said and limit writing rate correspond to baudrate this mean we can inject each RTCM msg to GPS module around 15ms at 115200 bps. As far as i test, in normal case, lastest 4-5/6 msgs burst (182*5 bytes) can be sequentially delivered to GPS without GPS loss.
bool updated = false;
uint8_t quota = 0;
do {
orb_check(_orb_inject_data_fd, &updated);
static hrt_abstime last_inject=hrt_absolute_time();
if (updated) {
static unsigned last_msg_len_injected = 182;
unsigned character_count = last_msg_len_injected;
static unsigned baudrate = _baudrate == 0 ? 115200 : _baudrate;
static unsigned sleeptime = character_count * 1000000 / (baudrate / 10);
if(hrt_elapsed_time(&last_inject) < sleeptime) {
break;
}
last_inject = hrt_absolute_time();
struct gps_inject_data_s msg;
orb_copy(ORB_ID(gps_inject_data), _orb_inject_data_fd, &msg);
if(msg.len > get_free_tx_buf())
{
PX4_WARN("Not enough txspace for gps left %d need %d", get_free_tx_buf(), msg.len);
break;
}
last_msg_len_injected = msg.len;
/* Write the message to the gps device. Note that the message could be fragmented.
* But as we don't write anywhere else to the device during operation, we don't
* need to assemble the message first.
*/
injectData(msg.data, msg.len);
++_last_rate_rtcm_injection_count;
quota++;
}
} while (updated && quota < 6);
I can confirm this work well in my swarm drone system.
@bkueng Should I open a Pull-Request to limit the time (actually, iterations) spent in the while-loop? It is a hacky solution, but still better than GPS timing out because the driver is choking :)
It was working fine for us for the past months as well as for @tdnet12434 it seems.
Alright, let's do that.
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.
https://github.com/PX4/Firmware/pull/12710 was merged.
Describe the bug When using GPS RTK, the gps-driver injects data from the
Base
(=ground-unit) to theRover
(=in-air unit). If there is a data burst from theBase
, then the gps-driver is occupied with writing to theRover
and no more data is read from theRover
, which results in GPS loss.To Reproduce Hardware setup with in-air gps device (no base station is required) and able to access
System Console
.Clone external_repo
Go to PX4-firmware and clean the build.
add
PX4_INFO
PX4_INFO("written: %d", written);
PX4_INFO("read: %d", num_read);
Build your target and upload. Include the
external_module
-repoOpen
System console
and turn on the drone: You should see the gps-read print that we added above.Open mavlink-shell
start
gps_injeciton_app
from themavlink-shell
:The
gps_injection_app
sends fakegps_inject_data
-msg for simulating a data burst. In theSystem console
, you will see that the gps-driver will be occupied with writing without having any read in between. On a real setup, this results in GPS-device loss.Expected behavior Reading from the gps-device should continue despite writing to the device.
Additional context A workaround that fixed the issue (temporary solution): Read data from the rover for every predefined amount of data written to the rover.