Panic with `invalid Gregorian date: Carry` while position-solving

georust / rinex

RINEX and GNSS data processing :artificial_satellite:

Apache License 2.0

63 stars 14 forks source link

Panic with `invalid Gregorian date: Carry` while position-solving #219

Closed jigpu closed 3 months ago

jigpu commented 3 months ago

I've been experimenting with the position solving function of rinex-cli and have bumped into a problem when processing some Rinex files. In particular, I get the following panic from Rust:

thread 'main' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hifitime-3.9.0/src/epoch.rs:827:14:
invalid Gregorian date: Carry
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: core::result::Result<T,E>::expect
             at /usr/src/debug/rust/rustc-1.76.0-src/library/core/src/result.rs:1030:23
   4: hifitime::epoch::Epoch::from_gregorian_utc
             at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hifitime-3.9.0/src/epoch.rs:826:9
   5: rinex::epoch::parse_in_timescale
             at ./rinex/src/epoch/mod.rs:228:25
   6: rinex::epoch::parse_utc
             at ./rinex/src/epoch/mod.rs:254:5
   7: rinex::observation::record::is_new_epoch
             at ./rinex/src/observation/record.rs:144:13
   8: rinex::record::is_new_epoch
             at ./rinex/src/record.rs:264:34
   9: rinex::record::parse_record
             at ./rinex/src/record.rs:396:29
  10: rinex::Rinex::from_path
             at ./rinex/src/lib.rs:811:34
  11: rinex_cli::user_data_parsing
             at ./rinex-cli/src/main.rs:110:28
  12: rinex_cli::main
             at ./rinex-cli/src/main.rs:156:24
  13: core::ops::function::FnOnce::call_once
             at /usr/src/debug/rust/rustc-1.76.0-src/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Reproduction Steps

Download mcso082.zip (source: https://www.ngs.noaa.gov/UFCORS/)
Unzip to a temporary directory
Run rinex-cli as follows: ./target/debug/rinex-cli -f tmp/mcso0820.24o -f tmp/mcso0820.24n -f tmp/igr23065.sp3 positioning --spp

Notes

I've noticed that the issue seems to be with the very end of the 24o file. If I trim a few records from the end then no crash occurs. I'm particularly suspicious of the line immediately before the splice. I don't know anything about the Rinex format, but that line looks different than anything else in the vicinity. If I remove that single line then no crash occurs. Here's the section in question:

  19998277.560   106939857.756 6      1777.523          41.250    19998286.320
  83175526.625 7      1382.520          43.500
  24055982.500   128728438.271 6      3597.980          41.750    24055991.900
 100122178.588 5      2798.426          36.750
                            4  1
RINEX FILE SPLICE; other post-header comments skipped       COMMENT
 24  3 22  3  0  0.0000000  0 20G05G10G13G15G16G18G23G26G27G29R03R04
                                R05R10R11R12R18R19R20R21
  24800361.320   130327183.699 7     -3590.820          44.500    24800362.400
 101553857.83747     -2798.039          43.250
  22609837.440   118815681.988 8      2764.531          48.000    22609841.580
  92583760.42549      2154.184          51.000
  23145998.820   121633148.168 8     -1449.113          46.000    23146000.000
  94779134.76645     -1129.180          36.750

And an except from a different 24o file with the same problem:

 100267330.027 5     -2642.504          37.000
  20836737.460   111462648.946 6     -2130.270          41.250    20836743.760
  86693279.954 8     -1656.879          45.500
  19998277.560   106939857.756 6      1777.523          41.250    19998286.320
  83175526.625 7      1382.520          43.500
  24055982.500   128728438.271 6      3597.980          41.750    24055991.900
 100122178.588 5      2798.426          36.750
                            4 11
RINEX FILE SPLICE                                           COMMENT
teqc  2019Feb25     NOAA/NOS/NGS/CORS   20240322 04:13:09UTCCOMMENT
Spider V7.8.1.9468                      2024 03 22 04:00    COMMENT
BIT 2 OF LLI FLAGS DATA COLLECTED UNDER A/S CONDITION       COMMENT
SNR is mapped to RINEX snr flag value [1-9]                 COMMENT
LX:    >= 25dBHz -> 1; 26-27dBHz -> 2; 28-31dBHz -> 3       COMMENT
       32-35dBHz -> 4; 36-38dBHz -> 5; 39-41dBHz -> 6       COMMENT
       42-44dBHz -> 7; 45-48dBHz -> 8; >= 49dBHz -> 9       COMMENT
Product                                                     COMMENT
teqc.e edited: all SBAS satellites excluded                 COMMENT
teqc.e edited: all QZSS satellites excluded                 COMMENT
 24  3 22  3  0  0.0000000  0 20G05G10G13G15G16G18G23G26G27G29R03R04
                                R05R10R11R12R18R19R20R21
  24800361.320   130327183.699 7     -3590.820          44.500    24800362.400
 101553857.83747     -2798.039          43.250
  22609837.440   118815681.988 8      2764.531          48.000    22609841.580
  92583760.42549      2154.184          51.000
  23145998.820   121633148.168 8     -1449.113          46.000    23146000.000
  94779134.76645     -1129.180          36.750

gwbres commented 3 months ago

Hello @jigpu, thank you so much for reporting this bug ! I will get back to you before next Wednesday

gwbres commented 3 months ago

Hello @jigpu,

Although I'm not 100% done, I've reached some conclusions on both topics. I will get back to this in the next few days

Debugging

I've taken a quick look and I think I understand what's happening here. I've demonstrated that the special SPLICE comment is correctly handled and does not cause problems. I did that by simply moving the special comment someplace else.

The problem comes from the line starting with " 4" right before the very last Epoch in the .24O file. Our src::observation::record::is_new_epoch determination method might not be robust enough, because it picked it up as a new Epoch candidate and that very content caused Hifitime to overflow.

Now if we look at this file, we can see that the last line of each Epoch is always empty. This is corrupt and does not respect the RINEX definition. " 4" looks like compressed RINEX (so called CRINEX) content. My guess is this file was generated by decompressing a RINEX and the tool that did that, in this very context (number of SV, number of observation..) has generated an empty line up until that point, which we kind of dealt with fine, and did not decompress the N-1 line of the N-1 Epoch, which we cannot deal with.

Another explanation is, this weird line is directly related to the "SPLICE" operation, since it comes right before it. Splice is probably explained in the teqc documentation.

PPP

Now if we move on to the cool stuff.

export RUST_LOG=trace  # verbosity

Note that rinex-cli cannot use Glonass satellite vehicles to resolve positions as of today, as stated in the front page. This means it is pointless to load the .24g Glonass NAV file. Since we're working with a V2 (old) context here, that means we're only left with GPS vehicles. Fortunately we have GPS in the SP3 file.

After removing N-1 from .24O I'm getting a hard time getting any results from the position solver. Now I understand why. I first decided to run some of these commands to become more familiar with your dataset:

./target/release/rinex-cli \
          -f .24o -f .24n -f sp3 \  # load the 3 files at once
          -P GPS \ # previous sidenote
          -g --clk --orbit --obs

The data is very high quality ⭐⭐⭐

All signals have very good power (see the SSI plot). This is most likely the result of stringent criteria at production level.

.24O sampling is fast (5s) I personnaly have never seen that (I'm used to the 30s standard).

The timeframe is narrow, we only have 2 hours of signal observations. That means we can only resolve using the first 2 hours, and the last 20 hours of the navigation data are pointless.

SP3 sampling is 15' which is the standard but it comes very slow compared to .24O.

In this context, the clock state will be determined by interpolating that data contained in the SP3, but 15' sample rate is slow, so you'll get that warning in the logs.

A geodetic marker is present in one of the headers (see the logs), I'll simply use that as the "truth" we should converge to and not provide custom coordinates.

./target/release/rinex-cli \
          -f .24o -f .24n -f sp3 \  # load the 3 files at once
          -P GPS \ # previous sidenote
          -p --spp

All Epochs fail to resolve because all SV were dropped (not enough candidates to resolve a position), without further explnation.

I debugged, and it turns out this line never resolves. This is where we compute the time of signal emission by the satellite.
As of now I have no explanation and will continue debug in the next few days. All I can say is, this is bad luck: it's the only case where we do not generate a debug message, therefore get no reasons why the vehicles is dropped : I will fix that.

jigpu commented 3 months ago

The problem comes from the line starting with " 4" right before the very last Epoch in the .24O file. Our src::observation::record::is_new_epoch determination method might not be robust enough, because it picked it up as a new Epoch candidate and that very content caused Hifitime to overflow.

Now if we look at this file, we can see that the last line of each Epoch is always empty. This is corrupt and does not respect the RINEX definition. " 4" looks like compressed RINEX (so called CRINEX) content. My guess is this file was generated by decompressing a RINEX and the tool that did that, in this very context (number of SV, number of observation..) has generated an empty line up until that point, which we kind of dealt with fine, and did not decompress the N-1 line of the N-1 Epoch, which we cannot deal with.

Another explanation is, this weird line is directly related to the "SPLICE" operation, since it comes right before it. Splice is probably explained in the teqc documentation.

I poked around trying to find / understand the RINEX 2.11 spec, and I get the feeling that the spurious "4 1" line is both intentional and related to the "SPLICE" operation. In particular, see the APPENDEX section of https://files.igs.org/pub/data/format/rinex211.txt, TABLE A7. There are a few records without an epoch, but a flag of "4", just like the files we're dealing with.

TABLE A2 is not super clear, but it seems to indicate that when the epoch flag is between 2 and 5, the following field contains the "number of special records to follow". Jumping back to TABLE A7, in every case where the flag is between 2 and 5, it looks like the number of special records is just the number of additional lines that need to be read. In the case of "4 N", this appears to mean "read N header lines" --- and indeed, this seems like it can be used as a way of signaling that N comment lines follow. I imagine that any other kind of header line could be seen, but comment is certainly one type of header line.

This explanation also nicely fits the splices I've observed. The "4 1" in the downloaded file occurs immediately before a one-line comment. And the "4 11" in the second code block of my report occurs immediately before an 11-line comment.

Now if we move on to the cool stuff.

[ snipped for brevity ]

Neat! I'll need to poke around a little more with some of the poorer-quality data to see what it looks like in comparison!

.24O sampling is fast (5s) I personnaly have never seen that (I'm used to the 30s standard).

The CORS map indicates that most of my local reference stations seem to be sampling at 5 and 15 second intervals. There's even a few that it says run at 1 second, especially up in Washington.

SP3 sampling is 15' which is the standard but it comes very slow compared to .24O.

In this context, the clock state will be determined by interpolating that data contained in the SP3, but 15' sample rate is slow, so you'll get that warning in the logs.

That's an interesting tidbit to keep in mind...

A geodetic marker is present in one of the headers (see the logs), I'll simply use that as the "truth" we should converge to and not provide custom coordinates.

The RINEX files from my phone don't have this data in the header, so I've had to provide custom coordinates. I'm just wondering what the effect of providing "wrong" coordinates is. Is it like something like: your custom coordinates were off by X feet east, so the solution is going to be off by X feet east? Or more like as long as your coordinates are not "close enough" to correct, then we won't be able to resolve a solution?

All Epochs fail to resolve because all SV were dropped (not enough candidates to resolve a position), without further explnation.

I debugged, and it turns out this line never resolves. This is where we compute the time of signal emission by the satellite. As of now I have no explanation and will continue debug in the next few days. All I can say is, this is bad luck: it's the only case where we do not generate a debug message, therefore get no reasons why the vehicles is dropped : I will fix that.

Best of luck with the debugging!

gwbres commented 3 months ago

The CORS map indicates that most of my local reference stations seem to be sampling at 5 and 15 second intervals. There's even a few that it says run at 1 second, especially up in Washington.

This helps reduce the portion of a day you can analyze and still hope for good centimetric results. I can't say for PPP since we don't have that capability yet, but it will obviously converge faster, at the expense of more computation load. You can see that our SPP++ solver consumes 1/4 of a day of high quality data sampled at 30 sec, to converge to cm/s velocity (test_resources/CRNX/V3/ESBCDNK and associated data for that day):

BTW the quality of that data is lower than yours

That's an interesting tidbit to keep in mind...

Yeah, it's quite tedious to have to deal with all these files, part of the problem is the complexity of the task. It also takes a while to truly understand what theyrepresent. This particular problem comes from the fact that typical SP3 files are published with 15' sampling which is too slow for ultra precise clock interpolation. To "answer" that problem, they introduced so called Clock RINEX - that we support now - which provide SV clock states sampled almost at the same time as signal observation. Conclusion: people aiming for ultra precise results usually provide one .24o, one .24n, one SP3 and one CLK file to rtklib

The RINEX files from my phone don't have this data in the header

That is normal because this information comes from a geodetic survey, which is quite a complex task and is far from a "real time" thing. I'm interested in understanding how you managed to have your phone spit out RINEX though :) but that should probably come in another discussion, or even could an new entry in our Wiki ?

I'm just wondering what the effect of providing "wrong" coordinates is. Is it like something like: your custom coordinates were off by X feet east, so the solution is going to be off by X feet east? Or more like as long as your coordinates are not "close enough" to correct, then we won't be able to resolve a solution?

Currently the coordinates you can input are considered the "truth", that means the result of the station (laboratory) geodetic survey and we use that to compare where we converge. As you've probably guessed, this only applies to stationnary and high precision usecases. So currently, our tool allows providing a geodetic survey yourself, or provide a "better" truth position. Providing "wrong" coordinates here will just increase the error we estimate as we do not converge towards that position, it will not cause the solver to panic.

Eventually, our tool should allow a secondary argument to provide the "initial" (user guess) position. This is the T0 initial position fed to the solver. Good PPP solvers like RTKlib allow that.
Our current behavior is to consume a first result as the initial guess. You can see this as the best thing to do without user input. The consequence is slower convergence (several Epochs consumed in that process). For people aiming at both precise and fast convergence (like almost real time PPP): this is bothering. But that scenario requires the initial guess to be good (like a geodetic survey result). Another solution is to pass (x0=0, y0=0, z0=0) center of the Earth as the initial guess, which is totally fine because the accumulator will converge fast from that ""very wrong"" initial guess. This is a possibility that rtklib also gives. The consequence of that, is you get very "wrong" initial phases in the results. Our current behavior, as you can see from the plots, obviously have much better initial phases.

Will keep you updated on both the SPLICE issue and being able to resolve with your context

gwbres commented 3 months ago

@jigpu

If you're willing to update your branch to the latest main content, I just pushed some improvements and fixed the issue you were experiencing. It was simply a unit/scaling problem when we interpolated time from SP3.

As expected from the quality of the signal sampling, the solutions are very good. Most certainly the best PVT solutions we have resolved with this tool so far. So I can only say thank you for sharing, and helping make this tool better. Stay tuned for future improvements (PPP and other simplifications) that will substantially improve the solver.

I have to tweak the settings a little bit, because 2 hours of data (8 SP3 data points), is incompatible with our default interpolation order of 11. I used this command line with this exact configuration:

cat rinex-cli/config/rtk/gpst_10sv_basic.json

{
   "timescale": "GPST",
   "interp_order": 6,
   "max_sv": 10,
   "solver": {
      "gdop_threshold": 3.0
  }, 
  "modeling": {
    "sv_clock_bias": true,
    "sv_total_group_delay": true,
    "relativistic_clock_bias": true,
    "tropo_delay": true,
    "iono_delay": true,
    "earth_rotation": true
  }
}

./target/release/rinex-cli \
       -f /tmp/mcso0820.24o \
       -f /tmp/mcso0820.24n \
       -f /tmp/igr23065.sp3 \
       -P GPS \  # as previously explained
       -p --spp \ # PPP not available yet
       -c rinex-cli/config/rtk/gpst_10sv_basic.json # force to custom settings

newplot(27) newplot(28) newplot(29)

Such a narrow timeframe, yet the solver converges to great results. If you look at my previous post, we reached slightly poorer results by consuming 12 hours of data.

Side note: The position solver section of the wiki pages is in the process of being written, so don't worry about all of this at the moment. I'm in the process of concluding our pending topics (mainly Rinex fops) prior publishing a new release, don't worry about the couple of warnings that introduced by the latest contributions

Note to self: it's a bummer the SV elevation determination does not seem to work well; I really need to fix that. If it did, it will be a very simple yet efficient mean to get even better results.

jigpu commented 3 months ago

Nice! I built the changes locally and can confirm the changes :+1:

I've also retried the 24o from my phone and was able to get a solution there as well. The solution is around 100m off with a whole lot of drift, but I don't suppose I should be too surprised given the source and quality of data :laughing: Its just 10 minutes of my phone sampling once-per-second sitting on an electrical box with half the sky under some degree of canopy. (EDIT: RTKLIB does quite a bit better with its solution of the same files in single-point mode and restricted to just GPS, clustering most of its points within ~10m of the expected spot. Still neat to finally see rinex-cli coming up with its own solution though and know that there are improvements yet to come! :smile: )

plots

BTW, I'm working on a patch to better deal with the "4 1" issue, hopefully will have something to review in the coming days...

I'm interested in understanding how you managed to have your phone spit out RINEX though :) but that should probably come in another discussion, or even could an new entry in our Wiki ?

I'll give you some breadcrumbs to start with :slightly_smiling_face: Google introduced a GnssMeasurement API in Android 7, making support for the API optional (depending on hardware) through Android 9. Beginning with Android 10 hardware support for the API was supposed to be mandatory. While I'm not aware of any apps that actually use the API for anything "normal", Google has released a GnssLogger App which can be used to dump the data in either a Google-specific CSV format or RINEX. (The GPSTest app is also capable of dumping measurements as CSV for conversion to RINEX, with an open issue to add native RINEX support).

Google has more info at https://developer.android.com/develop/sensors-and-location/sensors/gnss for anyone interested.

gwbres commented 3 months ago

This is so cool!

BTW, I'm working on a patch to better deal with the "4 1" issue, hopefully will have something to review in the coming days...

great to hear ! I'd be glad to review it and help you out.

EDIT: RTKLIB does quite a bit better with its solution of the same files in single-point mode and restricted to just GPS, clustering most of its points within ~10m of the expected spot

very interesting to hear. I'm not surprized. I presume RTKLib solely works on the so called "PPP" signal combination which is supposed to give a huge improvement almost "right away". I still have not introduced it yet, but in the end, I think pushing our SPP solver to the limit, with now the LSQ filter which is very close to a Kalman filter, was the best strategy. As you can see, we're still figuring a couple of scenarios that did not work, it helped stabilize the whole thing. Before introducing the combination, I should really make the effort to have the elevation and SNR calculations corrected. The elevation we estimate are sometimes totally wrong, it prevents taking this criteria into account (at least correctly).

Google has more info at https://developer.android.com/develop/sensors-and-location/sensors/gnss for anyone interested.

Thank you I'll look into this. Maybe in the end we could have an entry about this topic in our wiki pages

gwbres commented 1 month ago

Hello @jigpu,

a quick update on what's new and rerunning tests on your data. apriori position is no longer required: we can perform surveys from scratch. apriori knowledge of the final solution simply enables the comparison plots. Refer to the RINEX Wiki for more information

Side note: SP3 and/or CLK RINEX can now be omitted in the solving process. We call this "BRDC" for Radio based navigation. Is is most likely incompatible with centimetric positioning, but we'll later see how we perform. It allows surveying with BeiDou and QZSS constellations from now on.

Rerunning on the data attached to this post with latest rinex-cli:

./target/release/rinex-cli \
  -f WORKSPACE/mcso0820.24o \
  -f WORKSPACE/mcso0820.24n \
  -f WORKSPACE/igr23065.sp3 \
  -P GPS -p -c /rinex-cli/config/rtk/gpst_cpp_kf.json

Solution	x(err)	y(err)	z(err)
Final (15' average)	6m	6m	69mm

We're limited to GPS NAV. CPP strategy (best strategy as of today) can deploy since C1 (L1 civilian) and P2 (L2 precise) signals were sampled. With interpolation order 17 I have a little less than 15' of averaging, and achieve good results. Most likely due to the L2 precise signal, I don't think I have ever run a strategy that involved a precise signal yet. This setup is on par with Galileo performances. It's too bad we only have 2 hours of data, I can only imagine what it be with more observations.

For comparison, surveying of ESBJERG station with Galileo and 24h of observation:

./target/release/rinex-cli \
   -f test_resources/CRNX/V3/ESBC00DNK_R_20201770000_01D_30S_MO.crx.gz \ 
   -f test_resources/NAV/V3/ESBC00DNK_R_20201770000_01D_MN.rnx.gz \
   -f test_resources/CLK/V3/GRG0MGXFIN_20201770000_01D_30S_CLK.CLK.gz \
   -f test_resources/SP3/GRG0MGXFIN_20201770000_01D_15M_ORB.SP3.gz \
   -P GAL -P C1C,C5Q -p -c-c rinex-cli/config/rtk/gpst_cpp_kf.json

Solution	x(err)	y(err)	z(err)
Final (20h average)	-260mm	455mm	163mm