geospace-code / georinex

Python RINEX 2 / 3 NAV / OBS / sp3 reader & batch convert to HDF5 with C-like speed
MIT License
220 stars 91 forks source link

Proposal: Speed up data storing processing in xarray #23

Closed VOMAY closed 1 year ago

VOMAY commented 6 years ago

Dear scivision,

I ran a profiler on georinex (why did you change the name), while reading a 24h@30s Multi-GNSS Rinex 3 (LEICA GR25) file and found that the xarray.merge is responsible for >90% of the time consumption. This is not a surprise, but annoying nonetheless. I checked the code and saw that you are using xarray.concat instead of xarray.merge for single constellation cases. So I made a test:

#Galileo only: 57.44 sec
e=gr.readrinex('C:/Users/Volker Mayer/Desktop/ABMFxxGLP_R_20181330000_01D_30S_MO.rnx', use=['E'])
#GPS only: 59.79 sec
g=gr.readrinex('C:/Users/Volker Mayer/Desktop/ABMFxxGLP_R_20181330000_01D_30S_MO.rnx', use=['G'])
#GPS+Galileo: 2.89 min
eg=gr.readrinex('C:/Users/Volker Mayer/Desktop/ABMFxxGLP_R_20181330000_01D_30S_MO.rnx', use=['E', 'G'])

As you can see, reading GPS+Galileo at once takes about 50% longer than doing them one after another. This shows that concat is substantially quicker than merge. So here is my proposal:

Create one xarray dataset for each constellation and merge them only once at the very end. I am optimistic that this will allow at least 50% of time savings. Probably even quicker would be to use one xarray dataset per satellite and again merge them only once!

Best Regards Volker

scivision commented 6 years ago

Yes this is an excellent solution I think. I knew merge was taking a long time but didn't yet look into a solution as you did.

I changed the name to avoid possible (if unlikely) trademark issues. Also this toolbox is readily accessible via Matlab, although I haven't added examples yet, so I don't want to make people this is just for Python users.

VOMAY commented 6 years ago

Dear scivision,

I put some more thought in the way the data storing could be improved. Currently, the reading of a complete a Multi-GNSS RINEX file with 24h@30s takes more than 5 minutes with georinex using all 4 cores of my i7-7800. Our software NAPEOS(FORTRAN) needs 2 seconds to do this on only 1 core. Compared to this difference, my original proposal will only give a minor speed boost.

The main issue I see with georinex, is the fact that you are storing every single epoch in an xarray dataset. xarray is very powerful, but writing it is comparably slow. Neither xarray.concat nor xarray.merge should be used more often than necessary. In my humble opinion they should not be used here at all. I propose to use an intermediate data structure to store all the data first and then convert this structure at once into an xarray.

In NAPEOS we use linked lists: We allocate an array with one linked list per satellite. These linked lists are then appended for every epoch. Every element of the linked lists holds the time epoch and the respective signals as they are listed in the header. When reading, we assume that every new epoch comes later in time than the previous and can be simply appended to the end of the list. This gives a huge speed advantage. (It actually checks that it is the last epoch and finds the correct location in case it is not, but I have never seen this logical triggering) Python offers very efficient list and linked list functionalities, which could do the same thing. Potentially even better than Fortran. Unfortunately, I am not an expert here, but I got some input like this. The allocation of the array holding the linked lists is easy. Just do 1-35 for every constellation. It is highly unlikely that any constellation will ever have more satellites. Alternatively, you may want to just use a list here as well, which you then just append every time there is a new satellite. Python should offer you simple ways to implement this. More tricky is the conversion of the array of linked lists to the output xarray dataset, but that should also be doable.

I understand that this might end up in a quite a fundamental change in your software, but all users will thank you for saving them time. Just contact me in case you have any questions.

Best Regards Volker

scivision commented 6 years ago

Thanks Volker. Line profiling shows that per-epoch time conversion is penalized almost as much (within factor of 2) by the Numpy text=>ndarray conversion as xarray.concat

I suspect Numpy can be very much more efficient if the text is converted all at once, or at least in larger blocks--this "should" take place at C-like speeds when Numpy is given a big block of text to crunch. So to truly speed this up may get challenging for multi-system. Also, numpy.genfromtext doesn't seem to allow usecols and delimiter at the same time, losing another tool to speed reads.

When I think to why I created this package, it was because even some leading scientists were processing the RINEX ASCII over and over taking enormous amounts of time. The idea of this program was to do one-time conversions to NetCDF4/HDF5 and then work with that data. When I showed them the enormous speed boosts of working with hemisphere-wide GNSS networks over days of time, they were sold and can do much better science now with HDF5 one-time conversion of RINEX.

So yes having a 2-core-second conversion of a 24 hour file vs. 20 core-minutes for the same file is important in general, even if it's for a one-time conversion, when you have millions of files to convert, and growing rapidly.

Should the idea instead be to add HDF5 output to NAPEOS if not already there? The idea being to be able to do tomographic processing of data, scaling from thousands to millions of receivers at a time (out of core processing, done from the HDF5 files).

To speed this up a lot more (beyond 10's of percent here and there) may require more advanced measures like Cython.

VOMAY commented 6 years ago

I have the feeling we are not yet talking about the same. I'll try to clarify:

In your last answer you mentioned the use of numpy.genfromtext. In fact I am skeptical that this is the optimal method to read the data. However there is a reason why I did not bring it up yet: My profiler shows that numpy.genfromtext is not the part were georinex looses the most time. For large RINEX files (multiple MBs) like the ones I am using, more than 90% of the time is lost with concat/merging. So I focused on the latter.

In general, I think the use of NetCDF4/HDF5 for GNSS data is a very good idea. The output of georinex is very easy and quick to access. However, I think the way you are creating it, by running concat/merge every iteration, is not optimal. Above you said:

The idea of this program was to do one-time conversions to NetCDF4/HDF5...

This is a right thing to do and it is how the georinex output looks like, but is is not the way it works internally. In fact, you are doing an epoch-by-epoch conversion to NetCDF4/HDF5. A simpler intermediate data structure (e.g. a list) could be created much quicker and should allow a real "one-time conversions to NetCDF4/HDF5".

scivision commented 6 years ago

Do you have a public example of a large RINEX file you'd like to benchmark? I think from TimeRinex.py, which essentially is this function that it can be fast to:

  1. quick read RINEX 3 OBS files for SV & time only (using list)
  2. allocate the empty xarray.Dataset, sizing by time and SV read in step 1
  3. populate xarray.Dataset with data as the file is fully read (should be fast--no concat or merge, just indexing)

This could be even faster and significantly cleaner/simpler to implement than the linked lists.

By one-time I actually meant off-line, that is, convert say 10,000 RINEX files before the human actually works with them.

I don't care as much about RINEX 2 files as our large experimental data is all RINEX 3, so RINEX 2 is only for legacy data we didn't bother converting previously

VOMAY commented 6 years ago

Dear scivision,

About public RINEX data: Most public RINEX data shared in the IGS, so is ours. The IGS has multiple archive mirrors like: CDDIS and ftp://gssc.esa.int/ .

This command will get you the data of one of our best stations (Cebreros, Spain). Note: These files are all Hatanaka compressed and gunziped.

wget ftp://gssc.esa.int/gnss/data/daily/2018/200/CEBR00ESP_R_20182000000_01D_30S_MO.crx.gz

Like I said before, reading this file NAPEOS completely takes a few seconds. Georinex takes 1.6 minutes to read only GPS.

About Rinex2: Rinex2 is (unfortunately) still very popular in the GNSS community and this is not going to change any time soon. I recommend to maintain full support for it also in future.

About your code: Yes, the way you do it with timedat looks much better. If you use a native python list or another kind of (linked) list, plays a secondary role. I all comes down to a proper allocation of the xarray.Dataset. I already process thousands of RINEX files (several gigabytes) on a daily basis, so I know how crucial speed is here. If you are interested about what we do here at ESA (http://navigation-office.esa.int/), I am happy to tell you about it some time .

scivision commented 6 years ago

OK, checkout the latest release v1.6.1. It is more than 10x faster overall by preallocating Numpy array and only at the very end, after the big loop is done xarray.Dataset is created.
An OBS v2 file that took one minute to read now takes 5 seconds. I haven't done anything with OBS v3 yet in this regard.

VOMAY commented 6 years ago

Hey, I am glad to see that you could make some use of my advice. I can confirm the significant speed boost of the software. Good Job!

As discussed, it is all about the intermediate data structure to avoid any concat or merge. I see you decided to go straight for a preallocated numpy array, without a list. Telling by your comments in the code, I understand that you are already aware of the issues of the allocation:

  1. Max number of satellite per constellation to 32. If you decide to keep this hardcoded value, you should increase it to 35 to guarantee full BeiDou support.
  2. By using obstime2(fn), you actually read every RINEX twice.

Both issues can be avoided, if you put the data into a (linked list) and allocate the numpy array based on counters.

Thanks for this update. You are saving all users a lot of time!

scivision commented 6 years ago

Yes I felt that doing a fully-allocated Numpy array was simpler to debug, and that lists probably would not be faster than this, although lists would be better for memory somewhat. Since the problem is constrained to a reasonable number of satellites per system ~35, not wildly varying like a web application where an array may need to have length 10 one instant and then length 1 million another instant, the dense array is probably OK for now. At least I would want to implement this for RINEX 3 OBS before examining further--this was just a first pass. I first tried a preallocated xarray; that gave a 2x speedup. Going to Numpy as an intermediate array gave the 10x overall speedup.

With regard to obstime2, this is taking only 10 milliseconds or so, and the operating system caches this file so it's not a significant penalty, less than 1%

VOMAY commented 6 years ago

In the latest version obstime2 takes much longer than 10 ms. In my case it is 600 ms (will1190.zip 24@30s). This is now significant, since the total processing time dropped to 4.5sec.

Besides, I am a bit confused with the new _skip function. According to my profiler it is being called 28800 (700ms). This is 10x the amount of epochs in the RINEX file. To be honest, I have not yet fully understood every step of the latest implementation, but I have the feeling this could be optimized.

Talk to you soon!

scivision commented 6 years ago

I had to read/discard using _skip because when using TextIO streams (for seamless streaming of ZIP/Z/GZ/Hatanaka) seeking doesn't work, you just have to read and discard it seems. I did not microoptimize the skipping.

Yes I am still thinking about lists during looping instead of the preallocation that necessitates reading the file twice, since that's now expensive as you note.

scivision commented 6 years ago

preallocation without reading the file twice is enabled in v1.6.6. This is about a 15% speedup, and it may very well be faster than linked lists because the array is allocated once, somewhat larger than needed based on file size and assumption on average no fewer than 6 SV are visible.

Setting gr.load(..., fast=False) disables this option (reads file twice, exact preallocation).

There are measures that can be added for extremely large files (e.g. 50 Hz) that would make the preallocation smarter (reduce RAM usage) if this becomes an issue.

VOMAY commented 6 years ago

Thanks again for focusing on the speed boasting. Your smart allocation concept is interesting.

scivision commented 6 years ago

Great, v1.6.7 attempts to reduce RAM usage by reading a few dozen lines of data to better estimate needed RAM. If lots of measurements (L1, C1, etc.) are in file, this means fewer times for a given file size--this is determined simply from the header. However, if lines are truncated (< 80 char), this could throw off estimate so it progressively rolls back to a more conservative estimate if it detects line truncation.

It will also automatically shut fast=False if it can't detect properly (mangled, non-standard file).

VOMAY commented 5 years ago

Hey,

is there a reason why you only implemented this for RINEX2 and not for RINEX3?

scivision commented 5 years ago

@VOMAY just a lack of time to do so, waiting for someone to ask :+1:

anand-satnav commented 5 years ago

I am bumping up a really old issue here but as @VOMAY mentioned, the reading speeds are relatively quite slow, and this thread highlights the changes incorporated for a speed up for RINEX2.

Here is my improvement for RINEX3. (it does achieve the same speedup you observe in RINEX2)

https://github.com/anand-satnav/georinex/tree/citius

I cannot make a pull request since there are a few things I changed which breaks compatibility. I mainly did it for personal use in my research.

A measurement must be unique to a satellite system. A GPS L1C measurement is not the same as a GLONASS L1C measurement. The user may want to separate the two data sets for independent analysis and processing. I could go into further details but I think it deserves a separate thread. Also indexing/re-indexing operations are faster as integers rather than a string, but that will only make a difference in very large data-sets (>1GiB)

I chose to defer the Dataset creation and merge operations to the very end rather than merge as you iterate the file. This is a huge time saving as each xarray.Dataset merge operation is pretty expensive and you will be better off with merging a complete block of measurement rather than merging each measurement epoch-wise.

I can have another discussion related to more enhancements, improvements and new features if you guys are interested. I believe this software has potential to do a lot more than what it can currently and I would be happy to help.

VOMAY commented 5 years ago

Thanks @anand-satnav for your input. I cannot speak for @scivision , but I appreciate your input a lot. I know a lot of people, who would like to use this tool, but are waiting for a quicker RINEX3 implementation. I am in close contact with the IGS infrastructure working group and was told RINEX2 is not around for much longer.

anand-satnav commented 5 years ago

@VOMAY good to hear that it will be useful. Give my variant a try and let me know if that is alright.

VOMAY commented 5 years ago

@anand-satnav I just tested citius. The speed of the Rx3 reader finally deserves to be called fast. I'd loved to see this official release! Our compiled FORTRAN reader, which has been optimizes for decades, still takes only half the time, but it won't return such handy data format.

I dislike how you handle data variables names. Why would you use a hyphen in a variable name. calling obs.E-C1C returns an error. Of course obs['E-C1C'] is possible, but confusing and not user-friendly.

>>> obs.E-C1C
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/opt/anaconda3/envs/ana/lib/python3.7/site-packages/xarray/core/common.py", line 179, in __getattr__
    (type(self).__name__, name))
AttributeError: 'Dataset' object has no attribute 'E'

True, C1C for GPS, GAL and GLO are all signals different. But splitting them in georinex them won't reduce this confusion, only add more data variable names.

Splitting the PRN in constellation letter and the integer value and storing them separately is an option. I have seen it in various GNSS softwares. Storing the PRNs as integers is helpful for filtering them with mathematical operations, e.g. obs.sv <10. But why would you want to do that. From everything we know, the PRN numbers are randomly assigned to the satellites.

There is one thing I miss, which neither of your implementations allows: A compact way to slice the dataset by constellation. The readme recommends this:

Eind = obs.sv.to_index().str.startswith('E')  # returns a simple Numpy Boolean 1-D array
Edata = obs.isel(sv=Eind)  # any combination of other indices at same time or before/after also possible

This works of course, but the command is to long for such a simple command. Can we have an easy method to do this?

anand-satnav commented 5 years ago

@VOMAY Great to hear you tired it out and happy with the speed.

Hmm , splitting constellation letter from sv-no is an idea I can try. Maybe add it in as another co-ordinate. That can solve both your problems, slicing the dataset by constellation and number of data variable names. Will implement, benchmark and update you

VOMAY commented 5 years ago

I looked into my orbit comparison plotting tool, which is organized in a pandas DataFrame. It's multi-indexed as follows: [('Epoch', datetime), ('P', '<U1'), ('RN', np.int )]. I like the field names for their simplicity, but they are not suited for a general tool. Maybe 'System' for the constellation and 'sv' or 'prn' for the number. There is no perfect solution here. I can also live with havin the constellation letter in both coordinates. Have fun finding a better solution than the one in my orbit comparison tool ๐Ÿ‘ Thanks for the effort.

VOMAY commented 5 years ago

@VOMAY just a lack of time to do so, waiting for someone to ask ๐Ÿ‘

@scivision May I ask you again to look into this? I think @anand-satnav made great progress here already.

scivision commented 4 years ago

I see anand is still working on the citrus branch. I may have time at year-end to look at this and at least make it an optional module e.g. import georinex.obs3_fast I think that would be an OK starting approach. Thanks for this!

VOMAY commented 4 years ago

Dear @anand-satnav, dear @scivision thank you both for staying on this. I just tested the latest version and it's working nice and quickly. However, I still dislike the way you increase the number of data variables by splitting them per constellation. In July we discussed here an alternative data organization. I'd like to encourage you to consider this one again.

Best Regards

anand-satnav commented 4 years ago

@scivision I really didn't write citius branch to be compatible with the existing obs3. I have broken a lot of functionality. Just needed it for my use. Need to repair all that . I think its better to branch out a new one from the existing obs3 and implement my changes. Dunno. For performance, the basic idea is to defer the creation of arrays (numpy, xarray, etc) till the end. Use native python data structures (lists, dict) and it should do fine. No need to allocate arrays and update them. You can speed up obs2 with this approach.

@VOMAY Hey just as i mentioned above, the changes i made kinda works out for me and I moved on with the job. I will surely look at reducing the variables when I get some time . In the mean time, I am sure that you can pitch in too ๐Ÿ‘

scivision commented 4 years ago

Yes after refactoring obs2 previously, I also saw that sticking with lists and dicts until closer to the end also generally gave noticeably better performance.

I think I originally came into the task biased by earlier work with large homogeneous datasets that are a natural fit for ndarray, whereas Rinex processing in Python is heterogeneous enough in some senses to be faster and better memory-wise in native Python datatypes

Thanks!

VOMAY commented 4 years ago

What do you guys think of this: https://github.com/scivision/georinex/pull/59