pdbufr can take a very long time to filter certain BUFR files/messages

iainrussell commented 1 year ago

The following refers to a BUFR file in ECMWF internal file system under my username: $PERM/public/i0tp_07062020_00.buf. It contains 364,359 messages and 715,925 subsets. The following call to pdbufr takes around 41 minutes on the HPC:

import pdbufr

DIR = './'
filein = DIR + 'i0tp_07062020_00.buf'

df = pdbufr.read_bufr(filein,
    columns=("year",'month', "hour", "minute","latitude", "longitude", "atmosphericPathDelayInSatelliteSignal"),
                      filters={"stationOrSiteName": 'S3AG-EUME'},
    )
print(df)

I compared performance by running a pure eccodes-python script to iterate over the messages and unpack each one. That takes around 5 minutes.

I also modified the pdbufr code so as to avoid performing the actual filtering, and then it took around 5 minutes too.

I could see that certain sets of messages in the BUFR file were taking a lot longer to process than others. After putting a little profiling code into pdbufr and printing the indexes of the messages that were taking longer than 0.01 seconds to process, these were some of the message indexes (0-based) that consistently took longer:

count 36467 time 0.015330487862229347
count 82749 time 0.022938511800020933
count 110001 time 0.10010456619784236
count 110002 time 0.0794135918840766
… # most of the messages in between above and below
count 132562 time 1.6417972878552973
count 132563 time 0.033606610260903835
count 160917 time 0.04080211604014039
…
count 311422 time 0.02709425799548626
count 311423 time 0.010877656750380993
…
count 311619 time 1.4403185020200908
count 311620 time 5.845485555008054
…

Some standout times to process certain messages, which seemed consistent with multiple runs:

count 110774 time 15.490661825053394
count 110804 time 12.531519242096692
count 110878 time 10.05591939855367
count 110905 time 17.22438928298652
count 110909 time 14.911728173028678
count 110910 time 11.457727732136846
count 110911 time 12.165616693906486
count 110912 time 11.780960503034294
count 110913 time 9.73410841403529
count 110959 time 14.38328706100583
count 110960 time 0.0479189301840961
count 110961 time 12.562545870896429
count 110962 time 15.093519839923829
count 110963 time 16.94631726015359
count 110992 time 17.39802599698305
count 110993 time 18.96778273070231
count 110994 time 19.078917557373643
count 110996 time 0.017089318949729204
count 110997 time 13.097122138831764
count 110998 time 15.609233817085624
count 110999 time 12.680324002169073
count 111000 time 12.114437516778708
count 111001 time 19.27112303301692
count 111002 time 11.963714307174087
count 111196 time 25.567047986667603
count 111256 time 30.325322085991502
count 111552 time 39.79854283807799
count 111756 time 58.527211253065616
count 311590 time 16.124957408290356
count 311591 time 13.132242653053254

I copied a single message into $PERM/public/one_msg_110773.bfr (bufr_copy -wcount=110773 i0tp_07062020_00.buf one_msg_110773.bfr, which is the message at my 0-based 110774 index printouts) and it takes 15 seconds to process by itself with the same pdbufr script. This particular message has what looks like 6632 time periods encoded in it, and the single message occupies almost 140KB on disk.

Interestingly, calling bufr_dump on this message took only 0.6 seconds:

time bufr_dump -j a one_msg_110773.bfr >b.dump

shahramn commented 1 year ago

Also one can exclude certain keys to speed up the unpacking the data section. See https://confluence.ecmwf.int/display/UDOC/Performance+improvement+by+skipping+some+keys+-+ecCodes+BUFR+FAQ

iainrussell commented 1 year ago

However, we already do this: https://github.com/ecmwf/pdbufr/blob/master/pdbufr/bufr_structure.py#L630