Not corresponding datetime between h5 filename and h5 content (/what/time, "HHmmss")

stijnvanhoey commented 1 year ago

I downloaded a set of files from the bejab data as a test case and while trying out the CSV concatenation (to create a vpts-csv), I encountered repeated timestamps for multiple files not corresponding to the file name included timestamp:

FILENAME                         /what/time               
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000
bejab_vp_20221111T234500Z_0x9.h5 234500
bejab_vp_20221111T235000Z_0x9.h5 234500
bejab_vp_20221111T235500Z_0x9.h5 234500

To check, I downloaded some files from the Baltrad sftp directly and compared the timestamp of the file with the timestamp of the /what/time, leading to several of these differences (67% on my quick test on 50 files):

FILE    WHAT/TIME   FILEPATH
2250    2245        bejab_vp_20221112T225000Z_0x9.h5
0235    0230        bewid_vp_20221113T023500Z_0xb.h5
1635    1630        chppm_vp_20221114T163500Z_0xb.h5
0310    0300        dedrs_vp_20221115T031000Z_0xb.h5
0105    0100        defbg_vp_20221114T010500Z_0xb.h5
1025    1015        deisn_vp_20221115T102500Z_0xb.h5
0125    0115        denhb_vp_20221114T012500Z_0xb.h5
0505    0500        denhb_vp_20221114T050500Z_0xb.h5
1210    1200        eehar_vp_20221113T121000Z_0xb.h5
0410    0400        eehar_vp_20221114T041000Z_0xb.h5
0520    0515        esalm_vp_20221114T052000Z_0xb.h5
1410    1400        esbar_vp_20221113T141000Z_0xb.h5
1420    1415        essse_vp_20221114T142000Z_0xb.h5
1040    1030        esval_vp_20221115T104000Z_0xb.h5
0150    0145        filuo_vp_20221114T015000Z_0xb.h5
0255    0245        finur_vp_20221114T025500Z_0xb.h5
1440    1430        frabb_vp_20221114T144000Z_0xb.h5
0050    0045        frcol_vp_20221115T005000Z_0xb.h5
1835    1830        frmcl_vp_20221114T183500Z_0xb.h5
1340    1330        frmom_vp_20221114T134000Z_0xb.h5
2050    2045        frnim_vp_20221113T205000Z_0xb.h5
0320    0315        frniz_vp_20221113T032000Z_0xb.h5
0640    0630        frtou_vp_20221113T064000Z_0xb.h5
2250    2245        frtra_vp_20221114T225000Z_0xb.h5
0825    0815        frtre_vp_20221113T082500Z_0xb.h5
0605    0600        nohgb_vp_20221115T060500Z_0xb.h5
1555    1545        nosmn_vp_20221113T155500Z_0xb.h5
0020    0015        plram_vp_20221114T002000Z_0xb.h5
2205    2200        sekaa_vp_20221113T220500Z_0xb.h5
0210    0200        sevax_vp_20221113T021000Z_0xb.h5

@peterdesmet is this a known issue or am I stuck on a bug I just can't get around? For the latter experiment I relied only on h5py package as a dependency (I left out the vptstools modules and just tried to extract only the timestamps):

import h5py

file_paths = sorted(Path("../data/raw/baltrad/").rglob("*.h5"))

for j, path_h5 in enumerate(file_paths):
    with h5py.File(path_h5, mode="r") as odim_vp:
        time_filename = path_h5.stem.split("_")[2][9:13]
        time_h5_what = odim_vp["what"].attrs.get("time").decode("utf-8")[:-2]
        if time_filename != time_h5_what:
            print(time_filename, time_h5_what, path_h5)

The time difference might not be an issue if the timestamps are unique among the different files. Or should we rather use the timestamp from the file path of the h5 files?

peterdesmet commented 1 year ago

@BerendWijers you are probably familiar with timestamp differences between the filename and the metadata in the file?

Personally I would take the timestamp as written in the filename.

peterdesmet commented 1 year ago

I was wrong, bioRad uses the nominal time from the file metadata (22:45:00), not the filename (22:50:00):

library(bioRad)
vp <- read_vpfiles("bejab_vp_20221111T225000Z_0x9.h5")
> as.data.frame(vp)
   radar            datetime       ff        dbz         dens          u          v gap           w n_dbz        dd    n
1  bejab 2022-11-11 22:45:00      NaN  -2.742583 17.424041748        NaN        NaN   1         NaN   616       NaN  325
2  bejab 2022-11-11 22:45:00 6.999657  -4.776615 10.908013344 -6.8927603 -1.2186255   0 -40.6989670 15732 259.97382 4002

vptstools should do the same.

stijnvanhoey commented 1 year ago

Using the timestamp from the metadata, I'm running into trouble when concatenating multiple vp files into a vpts (for vpts CSV). E.g. when combining the files bejab_vp_20221111T233000Z_0x9.h5 233000 and bejab_vp_20221111T234000Z_0x9.h5 233000, with the following metadata:

FILENAME                         /what/time               
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000

the resulting table (as a Pandas DataFrame) ends up having duplicate entries, as both files contain the same timestamp 2022-11-11 23:30:00 in the metadata

	radar	datetime	height	u	v	w	ff	dd	sd_vvp	gap	eta	dens	dbz	dbz_all	n	n_dbz	n_all	n_dbz_all	rcs	sd_vvp_threshold	radar_longitude	radar_latitude	radar_height	radar_wavelength
0	bejab	2022-11-11 23:30:00Z	0	-8.04189	1.20268	0.383004	8.13132	278.506	3.04745	False	135.523	12.3203	-4.24786	21.9194	172	456	3988	8769	11	2	3.0642	51.1917	50	5.3
0	bejab	2022-11-11 23:30:00Z	0	-6.65065	-0.546647	-35.8612	6.67308	265.301	3.20129	False	238.708	21.7007	-1.78933	21.2997	520	1487	3928	8765	11	2	3.0642	51.1917	50	5.3
1	bejab	2022-11-11 23:30:00Z	200	-6.74806	-1.32668	-48.9435	6.87724	258.877	2.89076	False	132.228	12.0207	-4.35478	15.5665	5021	15915	8410	22624	11	2	3.0642	51.1917	50	5.3
1	bejab	2022-11-11 23:30:00Z	200	-6.46451	-1.81548	-62.7614	6.7146	254.313	2.86414	False	142.676	12.9706	-4.02448	17.685	4808	15605	8493	22671	11	2	3.0642	51.1917	50	5.3

@peterdesmet How do we handle this? Any chance we can cover this by the current standard or fix this in an earlier stage?

peterdesmet commented 1 year ago

@adokter how does bioRad deal with duplicate timestamps across multiple h5 files?

adokter commented 1 year ago

That isn't solved 100% satisfactory yet, see https://github.com/adokter/bioRad/issues/371 - vpts objects with duplicate timestamps are allowed, but we currently just pick the first profile in certain applications

peterdesmet commented 1 year ago

Picking the first profile is fine as an approach for me.

peterdesmet commented 1 year ago

Unfortunately, we can't enforce that through the standard. The combination of radar, datetime and height should be unique, but that cannot be expressed in Table Schema constraints. Although, it might be worth to test https://specs.frictionlessdata.io/patterns/#specification-12

stijnvanhoey commented 1 year ago

Picking the first profile is fine as an approach for me.

Ok, we'll take the first one in case duplicates appear.

stijnvanhoey commented 1 year ago

After review of the files generated in a first test batch, the decision was made to create non-conform vpts files with the duplicate timestamps included. Hence, duplicate timestamps in a single vpts will all be taken into account, see https://github.com/enram/vptstools/issues/22

enram / vptstools

Not corresponding datetime between h5 filename and h5 content (/what/time, "HHmmss") #11