ecmwf / pdbufr

High-level BUFR interface for ecCodes
Apache License 2.0
23 stars 8 forks source link

Feature/performance #54

Closed sandorkertesz closed 1 year ago

sandorkertesz commented 1 year ago

Fixes #53

The problem in #53 was caused by calling the following code too many times to determine whether a BUFR key is a coordinate:

code = eccodes.codes_get(key + "->code")
try:
    is_coord = code[:3] < 10

This PR improves the performance by caching the is_coord values within a given message and providing access to them via the newly added BufrMessage.is_coord() method. To make it work for messages defined as a mapping (e.g. dict) the message wrapper had to be modified. Please note that the coordinate test in the code snippet above was also optimised and implemented roughly like this:

code = self._get(name + "->code", int)
is_coord = code <= 9999

The results are encouraging. For the full BUFR file from #53 this code:

import pdbufr

f = "i0tp_07062020_00.buf"
df = pdbufr.read_bufr(f,
    columns=("year",'month', "hour", "minute","latitude", "longitude", "atmosphericPathDelayInSatelliteSignal"),
                      filters={"stationOrSiteName": 'S3AG-EUME'},
    )
print(len(df))

runs in 8 min 24 sec on my MacBook.

As a reference, iterating through the messages with eccodes and unpacking each with the code below takes 7:26.

import eccodes

f = open("i0tp_07062020_00.buf", "rb")

while 1:
    bufr = eccodes.codes_bufr_new_from_file(f)
    if bufr is None:
        break

    eccodes.codes_set(bufr, "unpack", 1)
    eccodes.codes_release(bufr)

f.close()
codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 87.93% and project coverage change: -0.33 :warning:

Comparison is base (b77c1cd) 95.46% compared to head (7c70ea4) 95.14%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #54 +/- ## ========================================== - Coverage 95.46% 95.14% -0.33% ========================================== Files 15 15 Lines 1587 1626 +39 Branches 210 212 +2 ========================================== + Hits 1515 1547 +32 - Misses 55 61 +6 - Partials 17 18 +1 ``` | [Impacted Files](https://app.codecov.io/gh/ecmwf/pdbufr/pull/54?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf) | Coverage Δ | | |---|---|---| | [pdbufr/high\_level\_bufr/bufr.py](https://app.codecov.io/gh/ecmwf/pdbufr/pull/54?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-cGRidWZyL2hpZ2hfbGV2ZWxfYnVmci9idWZyLnB5) | `85.41% <82.60%> (-2.59%)` | :arrow_down: | | [pdbufr/bufr\_structure.py](https://app.codecov.io/gh/ecmwf/pdbufr/pull/54?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-cGRidWZyL2J1ZnJfc3RydWN0dXJlLnB5) | `93.67% <88.88%> (-0.65%)` | :arrow_down: | | [pdbufr/high\_level\_bufr/codesmessage.py](https://app.codecov.io/gh/ecmwf/pdbufr/pull/54?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-cGRidWZyL2hpZ2hfbGV2ZWxfYnVmci9jb2Rlc21lc3NhZ2UucHk=) | `63.15% <100.00%> (+0.99%)` | :arrow_up: | | [tests/test\_20\_bufr\_structure.py](https://app.codecov.io/gh/ecmwf/pdbufr/pull/54?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-dGVzdHMvdGVzdF8yMF9idWZyX3N0cnVjdHVyZS5weQ==) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

iainrussell commented 1 year ago

That's wonderful @sandorkertesz! I've just run the code on an interactive node of our HPC and it takes 7m39s, a fantastic improvement from 41 minutes!