Closed stijnvanhoey closed 1 year ago
@BerendWijers you are probably familiar with timestamp differences between the filename and the metadata in the file?
Personally I would take the timestamp as written in the filename.
I was wrong, bioRad uses the nominal time from the file metadata (22:45:00), not the filename (22:50:00):
library(bioRad)
vp <- read_vpfiles("bejab_vp_20221111T225000Z_0x9.h5")
> as.data.frame(vp)
radar datetime ff dbz dens u v gap w n_dbz dd n
1 bejab 2022-11-11 22:45:00 NaN -2.742583 17.424041748 NaN NaN 1 NaN 616 NaN 325
2 bejab 2022-11-11 22:45:00 6.999657 -4.776615 10.908013344 -6.8927603 -1.2186255 0 -40.6989670 15732 259.97382 4002
vptstools should do the same.
Using the timestamp from the metadata, I'm running into trouble when concatenating multiple vp files into a vpts (for vpts CSV). E.g. when combining the files bejab_vp_20221111T233000Z_0x9.h5 233000
and bejab_vp_20221111T234000Z_0x9.h5 233000
, with the following metadata:
FILENAME /what/time
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000
the resulting table (as a Pandas DataFrame) ends up having duplicate entries, as both files contain the same timestamp 2022-11-11 23:30:00
in the metadata
radar | datetime | height | u | v | w | ff | dd | sd_vvp | gap | eta | dens | dbz | dbz_all | n | n_dbz | n_all | n_dbz_all | rcs | sd_vvp_threshold | vcp | radar_longitude | radar_latitude | radar_height | radar_wavelength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | bejab | 2022-11-11 23:30:00Z | 0 | -8.04189 | 1.20268 | 0.383004 | 8.13132 | 278.506 | 3.04745 | False | 135.523 | 12.3203 | -4.24786 | 21.9194 | 172 | 456 | 3988 | 8769 | 11 | 2 | 0 | 3.0642 | 51.1917 | 50 | 5.3 |
0 | bejab | 2022-11-11 23:30:00Z | 0 | -6.65065 | -0.546647 | -35.8612 | 6.67308 | 265.301 | 3.20129 | False | 238.708 | 21.7007 | -1.78933 | 21.2997 | 520 | 1487 | 3928 | 8765 | 11 | 2 | 0 | 3.0642 | 51.1917 | 50 | 5.3 |
1 | bejab | 2022-11-11 23:30:00Z | 200 | -6.74806 | -1.32668 | -48.9435 | 6.87724 | 258.877 | 2.89076 | False | 132.228 | 12.0207 | -4.35478 | 15.5665 | 5021 | 15915 | 8410 | 22624 | 11 | 2 | 0 | 3.0642 | 51.1917 | 50 | 5.3 |
1 | bejab | 2022-11-11 23:30:00Z | 200 | -6.46451 | -1.81548 | -62.7614 | 6.7146 | 254.313 | 2.86414 | False | 142.676 | 12.9706 | -4.02448 | 17.685 | 4808 | 15605 | 8493 | 22671 | 11 | 2 | 0 | 3.0642 | 51.1917 | 50 | 5.3 |
@peterdesmet How do we handle this? Any chance we can cover this by the current standard or fix this in an earlier stage?
@adokter how does bioRad deal with duplicate timestamps across multiple h5 files?
That isn't solved 100% satisfactory yet, see https://github.com/adokter/bioRad/issues/371 - vpts objects with duplicate timestamps are allowed, but we currently just pick the first profile in certain applications
Picking the first profile is fine as an approach for me.
Unfortunately, we can't enforce that through the standard. The combination of radar, datetime and height should be unique, but that cannot be expressed in Table Schema constraints. Although, it might be worth to test https://specs.frictionlessdata.io/patterns/#specification-12
Picking the first profile is fine as an approach for me.
Ok, we'll take the first one in case duplicates appear.
After review of the files generated in a first test batch, the decision was made to create non-conform vpts files with the duplicate timestamps included. Hence, duplicate timestamps in a single vpts will all be taken into account, see https://github.com/enram/vptstools/issues/22
I downloaded a set of files from the bejab data as a test case and while trying out the CSV concatenation (to create a vpts-csv), I encountered repeated timestamps for multiple files not corresponding to the file name included timestamp:
To check, I downloaded some files from the Baltrad sftp directly and compared the timestamp of the file with the timestamp of the
/what/time
, leading to several of these differences (67% on my quick test on 50 files):@peterdesmet is this a known issue or am I stuck on a bug I just can't get around? For the latter experiment I relied only on h5py package as a dependency (I left out the vptstools modules and just tried to extract only the timestamps):
The time difference might not be an issue if the timestamps are unique among the different files. Or should we rather use the timestamp from the file path of the h5 files?