Does the format allow multiple rows with the same height and timestamp?

niconoe commented 3 weeks ago

I couldn't find the answer in the documentation page, but I've encountered such a file: https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/daily/bewid/2024/bewid_vpts_20240314.csv

(that causes issues with CROW, see https://github.com/enram/crow/issues/16)

It would be great to clarify this expectation about the file (not just for me, but reflect it on the documentation page)

I also realize now that it's unclear to me if a single vpts-csv file can cover multiple radars?

peterdesmet commented 3 weeks ago

What is allowed by the standard?

Can a VPTS file cover multiple radars? Yes.
There is no expectation that the radar, datetime and height combination should be unique. I guess we should clarify that in the standard?

If we wanted to enforce a unique combination, it could be expressed with primaryKey, which would invalidate duplicate combinations:

"primaryKey": ["radar", "datetime", "height"]

I don't think we want that however, because when creating vpts files, it is hard to guarantee this (see further).

What is causing duplicate records?

We have encountered two causes for duplicate records:

The source files are provided at 5 minute intervals, but in those files, the vp$datetime attribute is rounded to 15 minutes, thus creating duplicate timestamps in the vpts data. This is the case for the bewid example you share.

bewid_vp_20240314T001500Z_0xb.h5 # vp$datetime: "2024-03-14 00:15:00 UTC"
bewid_vp_20240314T002000Z_0xb.h5 # vp$datetime: "2024-03-14 00:15:00 UTC"
bewid_vp_20240314T002500Z_0xb.h5 # vp$datetime: "2024-03-14 00:15:00 UTC"

The resulting VPTS CSV file will have different data for the same timestamp.

Multiple hdf5 files are provided for the same timestamp, but with different suffixes in the name. This is the case for this dkste example:

dkste_vp_20171201T0010Z.h5 | 2023-07-18T18:06:02.000Z
dkste_vp_20171201T0010Z_0xf00207.h5 | 2023-07-18T18:06:02.000Z
dkste_vp_20171201T0010Z_0xf00207_151208737782.h5

The resulting VPTS CSV file will have the same data for the same timestamp, except for source_file.

How can we fix duplicate records?

Processing from hdf55 to CSV is file-based, making it hard to catch this duplicates. As a result, they are present in the VPTS data. The VPTS data thus presents the full scope of the data that was there.

I think any fix should be done in the readers of the data, like CROW or bioRad.

niconoe commented 1 week ago

Can a VPTS file cover multiple radars? Yes.

Noted, thanks! CROW currently cannot deal with those but that's not a problem I guess, just something good to know for me!

There is no expectation that the radar, datetime and height combination should be unique. I guess we should clarify that in the standard?

I am not sure it should be clarified: the fact that a combination of field is allowed is the default case, so I general that's not the kind of things we need to be explicit about. But in this specific case and from the "data consumer" standpoint, I found it quite confusing. At least for CROW, we can only display a single value per height and time, so finding multiple ones in the source without clear guidance about what that represents (from the real world) nor indication of which one should be discarded feels a bit weird.

So while circumventing the issue by reading the first one in a specific reader is easy (I'll do it for CROW soon!), I am not sure I agree with "any fix should be done in the readers of the data" if we're talking at the community/standard level.

Let's take an analogy and say we are building a standard for satellite imagery, where each file is a mosaic of multiple pictures taken at different times by different satellites. If the goal of the standard is to provide an image of earth (each data point is a pixel with 3 dimensions: X, Y and color), it would feel strange to me to have multiple colors for a given X,Y coordinates, resulting in larger files and passing the following messages to readers (ala Google Maps): "Yeah, when you have two colors for the same point, just choose which one you want to show to the user, this happens because when generating the file we don't really want to deal with the case where two 'initial pictures' overlap.".

I feel like the "data curation" part (having two competing values for a point and choosing which one to include - or maybe something else, like the mean of the two) is better suited at the "data production" step. The obvious downside is some data loss, because of this curation.

Just my 2 cents, I don't pretend I am right :D

enram / vpts-csv