enram / vptstools

Python library to transfer and convert vertical profile time series data
https://enram.github.io/vptstools/
MIT License
4 stars 1 forks source link

Not corresponding datetime between h5 filename and h5 content (/what/time, "HHmmss") #11

Closed stijnvanhoey closed 1 year ago

stijnvanhoey commented 1 year ago

I downloaded a set of files from the bejab data as a test case and while trying out the CSV concatenation (to create a vpts-csv), I encountered repeated timestamps for multiple files not corresponding to the file name included timestamp:

FILENAME                         /what/time               
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000
bejab_vp_20221111T234500Z_0x9.h5 234500
bejab_vp_20221111T235000Z_0x9.h5 234500
bejab_vp_20221111T235500Z_0x9.h5 234500

To check, I downloaded some files from the Baltrad sftp directly and compared the timestamp of the file with the timestamp of the /what/time, leading to several of these differences (67% on my quick test on 50 files):

FILE    WHAT/TIME   FILEPATH
2250    2245        bejab_vp_20221112T225000Z_0x9.h5
0235    0230        bewid_vp_20221113T023500Z_0xb.h5
1635    1630        chppm_vp_20221114T163500Z_0xb.h5
0310    0300        dedrs_vp_20221115T031000Z_0xb.h5
0105    0100        defbg_vp_20221114T010500Z_0xb.h5
1025    1015        deisn_vp_20221115T102500Z_0xb.h5
0125    0115        denhb_vp_20221114T012500Z_0xb.h5
0505    0500        denhb_vp_20221114T050500Z_0xb.h5
1210    1200        eehar_vp_20221113T121000Z_0xb.h5
0410    0400        eehar_vp_20221114T041000Z_0xb.h5
0520    0515        esalm_vp_20221114T052000Z_0xb.h5
1410    1400        esbar_vp_20221113T141000Z_0xb.h5
1420    1415        essse_vp_20221114T142000Z_0xb.h5
1040    1030        esval_vp_20221115T104000Z_0xb.h5
0150    0145        filuo_vp_20221114T015000Z_0xb.h5
0255    0245        finur_vp_20221114T025500Z_0xb.h5
1440    1430        frabb_vp_20221114T144000Z_0xb.h5
0050    0045        frcol_vp_20221115T005000Z_0xb.h5
1835    1830        frmcl_vp_20221114T183500Z_0xb.h5
1340    1330        frmom_vp_20221114T134000Z_0xb.h5
2050    2045        frnim_vp_20221113T205000Z_0xb.h5
0320    0315        frniz_vp_20221113T032000Z_0xb.h5
0640    0630        frtou_vp_20221113T064000Z_0xb.h5
2250    2245        frtra_vp_20221114T225000Z_0xb.h5
0825    0815        frtre_vp_20221113T082500Z_0xb.h5
0605    0600        nohgb_vp_20221115T060500Z_0xb.h5
1555    1545        nosmn_vp_20221113T155500Z_0xb.h5
0020    0015        plram_vp_20221114T002000Z_0xb.h5
2205    2200        sekaa_vp_20221113T220500Z_0xb.h5
0210    0200        sevax_vp_20221113T021000Z_0xb.h5

@peterdesmet is this a known issue or am I stuck on a bug I just can't get around? For the latter experiment I relied only on h5py package as a dependency (I left out the vptstools modules and just tried to extract only the timestamps):

import h5py

file_paths = sorted(Path("../data/raw/baltrad/").rglob("*.h5"))

for j, path_h5 in enumerate(file_paths):
    with h5py.File(path_h5, mode="r") as odim_vp:
        time_filename = path_h5.stem.split("_")[2][9:13]
        time_h5_what = odim_vp["what"].attrs.get("time").decode("utf-8")[:-2]
        if time_filename != time_h5_what:
            print(time_filename, time_h5_what, path_h5)

The time difference might not be an issue if the timestamps are unique among the different files. Or should we rather use the timestamp from the file path of the h5 files?

peterdesmet commented 1 year ago

@BerendWijers you are probably familiar with timestamp differences between the filename and the metadata in the file?

Personally I would take the timestamp as written in the filename.

peterdesmet commented 1 year ago

I was wrong, bioRad uses the nominal time from the file metadata (22:45:00), not the filename (22:50:00):

library(bioRad)
vp <- read_vpfiles("bejab_vp_20221111T225000Z_0x9.h5")
> as.data.frame(vp)
   radar            datetime       ff        dbz         dens          u          v gap           w n_dbz        dd    n
1  bejab 2022-11-11 22:45:00      NaN  -2.742583 17.424041748        NaN        NaN   1         NaN   616       NaN  325
2  bejab 2022-11-11 22:45:00 6.999657  -4.776615 10.908013344 -6.8927603 -1.2186255   0 -40.6989670 15732 259.97382 4002

vptstools should do the same.

stijnvanhoey commented 1 year ago

Using the timestamp from the metadata, I'm running into trouble when concatenating multiple vp files into a vpts (for vpts CSV). E.g. when combining the files bejab_vp_20221111T233000Z_0x9.h5 233000 and bejab_vp_20221111T234000Z_0x9.h5 233000, with the following metadata:

FILENAME                         /what/time               
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000

the resulting table (as a Pandas DataFrame) ends up having duplicate entries, as both files contain the same timestamp 2022-11-11 23:30:00 in the metadata

radar datetime height u v w ff dd sd_vvp gap eta dens dbz dbz_all n n_dbz n_all n_dbz_all rcs sd_vvp_threshold vcp radar_longitude radar_latitude radar_height radar_wavelength
0 bejab 2022-11-11 23:30:00Z 0 -8.04189 1.20268 0.383004 8.13132 278.506 3.04745 False 135.523 12.3203 -4.24786 21.9194 172 456 3988 8769 11 2 0 3.0642 51.1917 50 5.3
0 bejab 2022-11-11 23:30:00Z 0 -6.65065 -0.546647 -35.8612 6.67308 265.301 3.20129 False 238.708 21.7007 -1.78933 21.2997 520 1487 3928 8765 11 2 0 3.0642 51.1917 50 5.3
1 bejab 2022-11-11 23:30:00Z 200 -6.74806 -1.32668 -48.9435 6.87724 258.877 2.89076 False 132.228 12.0207 -4.35478 15.5665 5021 15915 8410 22624 11 2 0 3.0642 51.1917 50 5.3
1 bejab 2022-11-11 23:30:00Z 200 -6.46451 -1.81548 -62.7614 6.7146 254.313 2.86414 False 142.676 12.9706 -4.02448 17.685 4808 15605 8493 22671 11 2 0 3.0642 51.1917 50 5.3

@peterdesmet How do we handle this? Any chance we can cover this by the current standard or fix this in an earlier stage?

peterdesmet commented 1 year ago

@adokter how does bioRad deal with duplicate timestamps across multiple h5 files?

adokter commented 1 year ago

That isn't solved 100% satisfactory yet, see https://github.com/adokter/bioRad/issues/371 - vpts objects with duplicate timestamps are allowed, but we currently just pick the first profile in certain applications

peterdesmet commented 1 year ago

Picking the first profile is fine as an approach for me.

peterdesmet commented 1 year ago

Unfortunately, we can't enforce that through the standard. The combination of radar, datetime and height should be unique, but that cannot be expressed in Table Schema constraints. Although, it might be worth to test https://specs.frictionlessdata.io/patterns/#specification-12

stijnvanhoey commented 1 year ago

Picking the first profile is fine as an approach for me.

Ok, we'll take the first one in case duplicates appear.

stijnvanhoey commented 1 year ago

After review of the files generated in a first test batch, the decision was made to create non-conform vpts files with the duplicate timestamps included. Hence, duplicate timestamps in a single vpts will all be taken into account, see https://github.com/enram/vptstools/issues/22