ecmwf / pdbufr

High-level BUFR interface for ecCodes

Apache License 2.0

23 stars 8 forks source link

Extension for GeoPandas and filter option for computed keys #32

Closed nklever closed 2 years ago

nklever commented 3 years ago

Hi Alessandro, thanks for this great repo of a high level API to BUFR ! I wrote an extension for using it also with GeoPandas as an alternative to pandas. Therefor I added two additional computed keys (geometry and CRS) and extended your great filtering mechanism also for computed keys - necessary to filter all observations e.g. within a circle around a centre (see the extended description or also the new test_50_geopandas file). I hope that this contribution will lead to another step using BUFR data easily ! Nik

FussyDuck commented 3 years ago

All committers have signed the CLA.

alexamici commented 3 years ago

@nklever this looks great!

Note that I'm not really maintaining pdbufr myself and I'm not sure of ECMWF intensions (@iainrussell ?). I need probably to update the Lead developer in the README 😓 .

I have a few comment before I can recommend to merge this PR:

at the highest-level, I don't like making GeoPandas a required dependency, could you change the code to not fail when GeoPandas and Shapely are not present? You would need to try import the modules and not support the geopandas option if they are not present. Tests may be tricky to do.
please, merge from master where I added running the GitHub actions on pull requests as well
please fix the code style, for example by running black .
you may find other issues raised by running all the tests the tests

I'll add other comments in the code.

alexamici commented 3 years ago

Finally, I'm not sure I get your implementation for filtering on the computed keys, but I'll come back to that later.

nklever commented 3 years ago

@alexamici I hope I now changed my code according to all your comments ! Thanks for this detailed review and useful hints !

nklever commented 2 years ago

@alexamici

As you mentioned in your comment #issuecomment-915032860:

_at the highest-level, I don't like making GeoPandas a required dependency, could you change the code to not fail when GeoPandas and Shapely are not present? You would need to try import the modules and not support the geopandas option if they are not present. Tests may be tricky to do._

the tests are on my opinion not feasible.

My new idea was to revoke all dependencies on geopandas in pdbufr, but let the additional computed keys geometry and CRS and also the filtering mechanism of computed keys in pdbufr. With this reduced changes, the usage of geopandas can now be completely separated from pdbufr to a new repo gpdbufr.

iainrussell commented 2 years ago

Hello, sorry, I should have looked at this sooner, but no time! First, thank you very much @alexamici for spending time to look at this, and of course thank you @nklever for this good-looking contribution! I will look at the code now, but my first comment is that it should be possible, if desired, to have the tests conditional on geopandas being available. See, for instance, the method used by climetlab here.

That said, it might indeed be cleaner to have the separation into a separate repo. I'll also ask @sandorkertesz what he thinks here.

Just note that that gpdbufr currently has lots of ECMWF copyright notices on it, which is not correct, because it is not an ECMWF development (it would be correct if the code were in pdbufr).

Cheers, Iain

iainrussell commented 2 years ago

I've allowed the github actions to run on this. First: @nklever , the tests show what I saw locally: "fixture 'file' not found". Also there is one file that 'black' does not like - I've found that 'black' can be temperamental depending on which version you've got installed, so it's a good idea to install the same version that is being used in the GA actions (19.10b0) Second: @alexamici, I don't think we want to run the 'deploy' action on a pull request do we? In any case we will review the GA actions and try to make them more like climetlab for consistency.

nklever commented 2 years ago

@iainrussell Thanks for your comment - yes, I reduced some code in the additional test. There's no difference between the black version of GA actions and my local installation !

Because of the copyright notices of gpdbufr - yes I copied pdbufr and changed only necessary code and setup files. But isn't it reasonable to move gpdbufr also to ECMWF ?

iainrussell commented 2 years ago

Hi @nklever , Unfortunately, each package comes with some support and maintenance burden, not to mention documentation, another set of github actions to maintain and probably another conda feedstock. We really do not have the person-power at the moment to support another package. So I think the options are that either you maintain gpdbufr yourself, or else we look at putting the code back into pdbufr, ensuring that geopandas is not strictly required. What do you think?

nklever commented 2 years ago

Hi @iainrussell, ok, thats reproducable ! I will check whether an example using geopandas and a test therefore could be done in pdbufr without the strictly requirement of using geopandas. On the other hand, your question rised the idea, to extend your question with a more generalized question whether filtering should be done in pdbufr or in pandas. Filtering in pdbufr is done with a very good algorithm, but it is (in it's original approach) limited to original bufr key data. I believe this can be extended to a generalized algorithm using a generalized computed keys approach. The current code is a hardcoded compromise using only geometry and CRS as additional computed keys. On the other hand filtering can also be done in pandas using query or numpy or other methods, but this would lead to essential more data in the original dataframe and could also be very complex. So I would ask back what filtering should be used ? I would prefer filtering in pdbufr. If you agree, I would additonally try to generalize the computed keys approach in the next time.

nklever commented 2 years ago

I've now added test_60_geopandas with the code from climetlab as @iainrussell mentioned. With this additional test, pdbufr is prepared for using it with geopandas, but it's not mandatory to be installed. The other idea of an algorithm using generalized computed keys should be done in a new PR - I will try this later.

iainrussell commented 2 years ago

Hi @nklever , Ok, for some reason, the github checks always seem to require me to approve them before they run, so they were not run automatically, but have been now, with various errors unfortunately (it's not obvious to me why the test_50 failure occurs). For your question, I think that it would be best to perform the filtering in pdbufr because, as you say, otherwise pdbufr will always need to return extra columns just in case someone wants to perform extra geopandas filtering on it. I guess it could be a flag to say whether to include it or not, but I think it's ok as is. I'll keep an eye on the commits here and re-run the tests when I see a new one!

nklever commented 2 years ago

@iainrussell Ok, the problem of the automated workflow in my repo was, that on default the workflow was not enabled during a fork. After enabling it in my repo, it works automatically, but only in my repo. Because it was all but impossible to reproduce all steps in the workflow locally, many iterations are necessary until this current last commit, which gives all successful tests (see Actions in my repo).

Nevertheless it seems - as you stated - the github checks still require you to approve them before they run, so they were not run automatically in this original repo.

I've now also reduced my repo gpdbufr to an example of pdbufr (with the code of this last commit) usage with geopandas.

iainrussell commented 2 years ago

OK @nklever, this is really nice! My apologies for not getting back to you sooner! I'm going to merge this into master now. Well done and thanks for your work, I think it's a nice addition without adding too much to the core of pdbufr.

codecov-commenter commented 2 years ago

Codecov Report

Merging #32 (031f55f) into master (281a2b4) will increase coverage by 0.36%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #32      +/-   ##
==========================================
+ Coverage   95.20%   95.57%   +0.36%     
==========================================
  Files          12       13       +1     
  Lines         792      858      +66     
  Branches       93      101       +8     
==========================================
+ Hits          754      820      +66     
  Misses         31       31              
  Partials        7        7

Impacted Files	Coverage Δ
pdbufr/bufr_read.py	`100.00% <ø> (ø)`
pdbufr/bufr_structure.py	`100.00% <100.00%> (ø)`
tests/test_40_sample_data.py	`99.15% <100.00%> (ø)`
tests/test_50_filter_automated_keys.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 281a2b4...031f55f. Read the comment docs.