ScaleWeather / eccodes

Unofficial high-level Rust bindings of the latest ecCodes release.
Apache License 2.0
7 stars 6 forks source link

Read from index #1

Closed asura6 closed 8 months ago

asura6 commented 1 year ago

Is your feature request related to a problem? Please describe. As GRIB-files are binary dumps of message-sections index-files are used to encode the precalculated binary location of messages matching a set of keys. Some GRIB-files contains thousands of messages and to facilitate fast reading of messages index-files are delivered with the files and use to query the binary location of a message and receive handles. Loading GRIB-files by reading an index-file is required for performance in certain situations.

Describe the solution you'd like I would like safe rust bindings for reading from GRIB index-files.

Additional context I would like to use this crate to eventually migrate from C++ for working with GRIB and this is a required feature. I wouldn't mind trying to implement this feature myself but I'm still inexperienced working with rust and starting off with unsafe rust seems a bit daunting.

Index-files are supported by the eccodes C-api. It's also possible to create an index file quite simply with eccodes CLI-tools to get something to experiment with.

I believe this is the C-interface for reading from an existing index https://github.com/onyb/eccodes/blob/master/src/eccodes.h#L242

codes_index* codes_index_read(codes_context* c,const char* filename,int *err);

The workflow is to index a file using a set of predetermined keys, such as ["shortName","typeOfLevel","level","stepType"]. You then read the index-file, selecting a value for each of the keys the index was created with, e.g. selecting

codes_index_select_string(index,"shortName","t");
codes_index_select_string(index,"typeOfLevel","heightAboveGround");
codes_index_select_string(index,"level",2);
codes_index_select_string(index,"stepType","instantaneous");

and then retreiving the grib-handle from the index using codes_handle_new_from_index

Quba1 commented 1 year ago

Hi! Thank you for your feature request. It's good to see your interest in this crate.

I haven't implemented originally reading from index-files because I didn't have any experience with that. But it sounds from your description that they are a very useful feature. So it sounds like a good idea to add that feature to this crate.

I wouldn't mind trying to implement this feature myself but I'm still inexperienced working with rust and starting off with unsafe rust seems a bit daunting.

If you don't mind trying to create a PR for this feature, please feel free to try! I certainly won't be judging you by your coding skills. In fact, this crate was one of the first things I wrote in Rust, so I didn't have experience with (unsafe) Rust at the time of creating it as well. It sounds like you, having experience with C++ and GRIB index-files, are much better equipped to implement that feature than me.

To give you a brief overview. There are three levels of eccodes bindings used in this crate.

Feel free to implement the new feature however you like, if you find that your solution works better than mine. The way I would probably implement reading index-files is:

1) "Rustify" necessary eccodes functions in intermediate_bindings module (feel free to add a new submodule if you find that helpful) 2) Create a new module with new type CodesIndex with constructor etc. 3) Implement TryFrom<CodesIndex> for CodesHandle for converting into CodesHandle

But again, feel free to implement this in your preferred way.

Also, don't worry about the documentation - I will not mind writing docs myself if you implement this feature.

Thanks again for your interest in this crate!

asura6 commented 1 year ago

Thank you for the guidance. I'll give it a try!

Quba1 commented 1 year ago

(moving discussion from PR to here)

@asura6 wrote:

I've had a go at adding rudimentary index support. This is the first time I'm using rust except for writing hello world so please take a closer look and feel free to adjust to your liking!

A CodesIndex type can now be constructed from a preexisting index-file. I've added one used in tests in the data-directory for one of the icelandic grib-file examples.

The CodesIndex has a Select trait that is used to select the index-keys. The underlying C-api has different select-functions depending on the value type. I implemented the select-trait to call the corresponding one depending on the value type in the function call instead.

The CodesIndex can convert into a CodesHandle, but the CodesHandle type makes assumptions about being created from files. I've done my best to work around this. Even if an index-file doesn't already exist, creating one in memory is a nice and performant alternative to iterating through messages. I have however not implemented any of these features.

Due to the CodesHandle not being backed by file-data I also implemented a conversion from a GribHandle to a KeyedMessage without using an iterator since it won't work without an actual resource backing the GribHandle.

Quba1 commented 1 year ago

I have merged your changes into a working branch through #3 so we can more easily collaborate on the code. But to keep discussion in one place I suggest discussing here.

I will post my comments to the PR in the next message.

Quba1 commented 1 year ago

Thank you @asura6 for your contribution!

For future, please try to keep commits more atomic so that one commit concerns only one topic. For example, extending unsafe block in CodesHandle drop would be great to have in a separate commit - but it is a great improvement nonetheless.

My main concern is that I seem to be missing why would CodesIndex (more specifically index_handle) own GRIB data (as stated in TryFrom<CodesIndex>)? In theCodesIndex` constructor only a path to an index-file is provided, and Eccodes does not give really any guarantees about anything. So the grib-file relating to index-file may not even exist, if I understand correctly?

Wouldn't it be safer to instead of implementing TryFrom<CodesIndex> for CodesHandle add a new constructor for CodesHandle named new_from_file_with_index()? I know I suggested using TryFrom previously - I didn't know how eccodes API for indexing works. That additional constructor could then take CodesIndex as argument and take ownership of grib-file. Would that make sense?

I really like the idea of using Trait for selecting keys. I see that you are calling unsafe functions in select() without any additional checks. But if I understand correctly index_handle cannot be modified after constructing so it's always safe to use it. Is that right?

Thank for adding tests for checking your code!

Quba1 commented 1 year ago

I've introduced two small changes to the working branch with your PR:

Quba1 commented 1 year ago

Out of interest, NOAA NCEP provides index files for the GRIBs from their GFS model, like this one gfs.idx. But eccodes fail to read it. Do you know why that is? Do they use some different type of index-files?

asura6 commented 1 year ago

Thank you for the engagement and review. I appreciate if you continue commenting on both improvements in workflow and the code quality if any comes to mind. I'll properly get back to you when I have some more time on my hands.

Out of interest, NOAA NCEP provides index files for the GRIBs from their GFS model, like this one gfs.idx. But eccodes fail to read it. Do you know why that is? Do they use some different type of index-files?

Would you mind providing the mentioned index-file?

One problem I've previously encountered with grib-indices is that they encode paths to grib-files. Like this

grib_dump data/iceland-surface.idx 
***** FILE: data/iceland-surface.idx 
GRIB File: ./data/iceland-surface.grib
Index keys:
key name = shortName
values = 10u, 10v, 2d, 2t, msl
key name = typeOfLevel
values = surface
key name = level
values = 0
key name = stepType
values = instant
Index count = 5
no messages found in data/iceland-surface.idx

This format imposes a requirement of matching file paths (see GRIB File: ./data/iceland-surface.grib). I don't particularly like this convention.

Sometimes indices are also created by removing keys with only a single value (in the above example only shortName would be indexed in that case). This makes a lot of sense to me but one of my coworkers encountered problems using such index-files. I don't understand why it would but we since then always create them by explicitly listing all keys and values.

Quba1 commented 1 year ago

Would you mind providing the mentioned index-file?

Sorry, my mistake. I've updated the original message to include link to the index file.

asura6 commented 1 year ago

I've never seen those kind of index files before. I'm not sure if eccodes supports them.

I couldn't help myself from playing around a bit and was able to extract arbitrary messages from gfs-grib files on the command line using dd by interpreting the second column as a byte offset.

Quba1 commented 8 months ago

Hi again @asura6 and thanks for your patience!

Your code was a great starting point for introducing Index capabilities to this crate and guided me how to use codes_index. Thanks for your contribution.

I started working working on structuring the API and adding more tests to your PR (you can see that on codes-index branch). But then I unfortunately discovered that codes_index might not be thread-safe. ecCodes documentation states that when built with threads options the library is thread-safe, but it seems otherwise. You can see example of that in #10.

The issue seems to be that codes_index_read, codes_index_add_file and possibly codes_index_delete, cannot be called for the same/related .grib/.idx files if in another thread one of those functions is running.

This issue did not appear for CodesHandle, because *codes_handle is created from *FILE which is derived via fopen() from fs::File. My guess is that as fs::File is the function interacting with OS it assures the thread-safety for all other functions. For codes_index ecCodes itself interacts with filesystem which might be the issue.

I will try to contact ecCodes devs about this issue and see what they think about that.

One solution in Rust side that I see is adding global Mutex like netcdf crate does. But that will need more testing.

In the meantime, thanks again for your contribution. If you have any thoughts or ideas about this issue or CodesIndex API, do let me know!

Quba1 commented 8 months ago

It seems that mutex have solved this issue, although it's not optimal :/ I think after adding documentation CodesIndex can be merged into main and published on crates.io. But I will wait for ecCodes devs response before merging.

Quba1 commented 8 months ago

After more testing it seems that unfortunately that even with Mutex indexing functions throw errors and interfere with non-indexing functions as well. But segfaults don't appear anymore.