ecmwf / pdbufr

High-level BUFR interface for ecCodes
Apache License 2.0
23 stars 8 forks source link

Improve filter speed #2

Open sandorkertesz opened 4 years ago

sandorkertesz commented 4 years ago

The performance of the bufr filter should be improved. It is currently 4-5 times slower than the BUFR filter in Metview Python (it is based on a C++ wrapper around ecCodes), which is already slower than the bufr_filter ecCodes command line tool. The following test case illustrates the problem:

File test.bufr contains 3927 synop messages and we want to extract the 2m temperature values form it. This is the test code in Metview Python:

import metview as mv
f=mv.read('test.bufr')
gpt = mv.obsfilter(data=f,
    output="csv", 
    parameter='airTemperatureAt2M'
)
res= gpt.to_dataframe()
print(len(res))

and this is the code with pdbufr:

import pdbufr
f = 'test.bufr'
res = pdbufr.read_bufr(f, columns=('latitude', 'longitude', 'airTemperatureAt2M'))
print(len(res))

The execution time is as follows:

shahramn commented 4 years ago

Re: BUFR Decoding performance

Please note the trick of excluding some keys documented here: https://confluence.ecmwf.int/display/UDOC/Performance+improvement+by+skipping+some+keys+-+ecCodes+BUFR+FAQ


From: Sandor Kertesz notifications@github.com Sent: 05 November 2019 10:20 To: ecmwf/pdbufr pdbufr@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [ecmwf/pdbufr] Improve filter speed (#2)

The performance of the bufr filter should be improved. It is currently 4-5 times slower than the BUFR filter in Metview Python (it is based on a C++ wrapper around ecCodes), which is already slower than the bufr_filter ecCodes command line tool. The following test case illustrates the problem:

File test.bufr contains 3927 synop messages and we want to extract the 2m temperature values form it. This is the test code in Metview Python:

import metview as mv f=mv.read('test.bufr') gpt = mv.obsfilter(data=f, output="csv", parameter='airTemperatureAt2M' ) res= gpt.to_dataframe() print(len(res))

and this is the code with pdbufr:

import pdbufr f = 'test.bufr' res = pdbufr.read_bufr(f, columns=('latitude', 'longitude', 'airTemperatureAt2M')) print(len(res))

The execution time is as follows:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ecmwf/pdbufr/issues/2?email_source=notifications&email_token=AF4HFU2X6NDHUSDJ5LMRREDQSFCITA5CNFSM4JJASWFKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HW3FFFQ, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF4HFU2VTNP6HZMHTMDJ3FDQSFCITANCNFSM4JJASWFA.

alexamici commented 3 years ago

The main bottleneck is going from Python to C, by far.

A major break-trough was reached here:

https://github.com/ecmwf/pdbufr/commit/479c2540eb87b6ad2888a2c51d7cedeec0b27c59

by caching the message keys for similar messages.

Several benchmark cases gain a 30-35% speed-up.