Closed peterdesmet closed 2 years ago
@adokter @bart1 @niconoe @CeciliaNilsson709 @baptischmi @Rafnuss Feedback welcome on the above proposal.
I think I slightly prefer # source/format/radar/yyyy/
, but either way it will fit some users and not others. Tools will have to be adapted between the US structure and this anyway even if we put year first too.
No strong preference between the two proposals. I agree with all the design principles and think we should stick to one of the two proposals.
I'm not a fan of the approaches that use too many levels such as yyyy/mm/dd/HH/MM/
: little added value, but add a lot of heavy path manipulations in tools, in my experience.
Hi @peterdesmet, here a few thoughts:
I've found that directory structure matters a lot in terms of how quickly you can fetch the data from AWS, and you want to avoid too many files in a single directory as it really slows downloads and searches for available files. Therefore, for h5 I would add a deeper directory structure than for csv.
My preference is a directory structure with date before radar. The reason is this better suits full-network analyses, in which you want to be able to efficiently grab all the available radar data for a given date or period. Such downloads are much more fast when date is up front in the tree. In fact I have processed vp data into radar/date order in the past, and I've come to regret it because of how slow data retrieval from s3 becomes.
A main downside of a date/radar ordering is that downloading multiple years for a single radar requires more cycling through the tree. But I've found that is typically fairly easy, because the set of potentially available dates to search for is well defined (while the set of potentially available radars is not). I suspect these considerations also led NEXRAD to order date before radar, see https://s3.amazonaws.com/noaa-nexrad-level2/index.html
Taken these together, for h5 I would recommend source/format/yyyy/mm/dd/radar
, and for csv source/format/yyyy/radar
Why daily csv files and not monthly? I find monthly files a nice compromise between file size and having a substantial period in the file. Daily files are very short and you end up with many, while year files get very big.
Thank you @peterdesmet
I'll support @adokter suggestion, although # source/format/radar/yyyy/ allow to quickly get the temporal coverage of a radar (which is not granted in Europe).
As mentioned by @niconoe , its good to have few levels only (as proposed here in contract to yyyy/mm/dd/HH/MM/) for monthly csv tables, but as mentioned by @adokter maybe not for single h5-files, since having too many files slows the search and download of files.
@peterdesmet thanks for the explanation. For clarity I will first clarify what we use @ uva to deal with different projects and data. This is a somewhat similar problem to the pipeline issue. As we deal with both pvols and vps and both can have their own differences we have a 2 tiered system for pvols and three for vps.
The structure is:
project/pvol_settings
for pvols and project/pvol_settings/vp_settings
for vp. Each vp in this structure can be easily referred back to a pvol. We quite regularly end up exploring different settings for both constructing the pvols and vp. It might be worth considering that vp can be calculated with different settings. @berendwijers do you have anything to add to this?
On the order of year and radar I do prefer radar first as it allows for a quick overview of what time period a radar covers. Although it is hard to know how much of that is just the habit of always having it like that.
Thanks for the suggestions all!
@adokter: Therefore, for h5 I would add a deeper directory structure than for csv.
Agreed, that is a worry I had too.
@adokter My preference is a directory structure with date before radar. The reason is this better suits full-network analyses, in which you want to be able to efficiently grab all the available radar data for a given date or period. vs @baptischmi @bart1 ... quick overview of what time period a radar covers ...
Even though Europe probably lends itself less to full-network analyses than the US, I can see how efficiently sampling a time period will probably always be part of an analyses. An overview per radar can likely be provided as summary data, cf. the coverage.csv
file we currently have. So I'd support putting temporal information first.
@adokter Why daily csv files and not monthly?
For CROW small files are better (since it's all done in browser); we currently use daily files. For bioRad we could produce monthly files over yearly files if that is more convenient? If we do, I would introduce month directories.
@bart1 We quite regularly end up exploring different settings for both constructing the pvols and vp.
That makes sense for UvA, but for the data repository I hope to provide a consensus view, where the best possible data is given for a certain source. We already add the complexity of choosing between different sources, I'd like to avoid adding the complexity of having to choose between different processing too.
Given that monthly files might be more convenient and we don't want too deep a file structure, we could use month
directories for all? Or would you prefer to keep the day
as part of the path?
# source/format/yyyy/mm/radar/
# original hdf5 vp files
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/bejab/... # 60/5*24*31 = 8.556 files per directory
baltrad/h5/2020/01/bejab/bejab_vp_20200131T235500Z_0x9.h5 # last file for that month
baltrad/h5/2020/01/bewid/
baltrad/h5/2020/02/behel/
# tabular data products (zipped and unzipped)
baltrad/csv/2020/01/bejab/
baltrad/csv/2020/01/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/01/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/01/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/01/bejab/... # 31 files per directory
baltrad/csv/2020/01/bejab/bejab_vpts_20200131.csv
To more easily compare, here's the structure suggested by @adokter. It mimics the US structure (for the h5 data).
# source/format/yyyy/mm/dd/radar/
# original hdf5 vp files
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day
baltrad/h5/2020/01/01/bewid/
baltrad/h5/2020/02/01/behel/
# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/bejab/... # 12 zipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 unzipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_20200131.csv
Personally I think that structure works quite well and is better than the month directories I suggested above.
# source/format/yyyy/mm/dd/radar/ # original hdf5 vp files baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5 baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5 baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day baltrad/h5/2020/01/01/bewid/ baltrad/h5/2020/02/01/behel/ # tabular data products (zipped and unzipped) baltrad/csv/2020/bejab/ baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz baltrad/csv/2020/bejab/... # 12 zipped files per directory baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz baltrad/csv/2020/bejab/bejab_vpts_20200101.csv baltrad/csv/2020/bejab/bejab_vpts_20200102.csv baltrad/csv/2020/bejab/... # 365 unzipped files per directory baltrad/csv/2020/bejab/bejab_vpts_20200131.csv
I have been thinking a bit more about this
With this structure checking data availability for long time series in the original vp data gets slightly more difficult as many directories need to be searched. Especially as radars/data stream might be on and of some what frequent (leading to days missing). As most people might be accessing data through the csv this is maybe not so much of an issue. But for data quality checking I tend to first go by radar.
This is might be a wider argument for going for radar first as the first step for the analysis at least for European data is frequently doing a quality check, including checks if the quality changed over time. A structure with radar first might facilitate this.
Also I do imagine that people are more likely to to analysis with a limited geographic scope compared to a limited temporal scope. Here I'm more thinking about for example ecological consulting interested in a region. The people wanting to do analysis on a large geographic scope I guess also want a long time series and are likely more technically savy.
Thanks for the input @bart1, I agree with your arguments. The European data are not homogenous in quality or coverage like the US data, but very radar (country) dependent. So it makes sense to be able to select on radar up front. And as @CeciliaNilsson709 mentions, we'll likely require different functions anyway to query US vs EU data, so aligning is only partly useful. In any case, I don't think there is a right decision here, we just need to make one.
So, suggestion 5. š, š, feedback welcome.
# source/format/radar/yyyy/mm/dd/
# original hdf5 vp files: same as original proposal but deeper hierarchy
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000000Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000500Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/... # 60/5*24 = 288 files per directory
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T235500Z_0x9.h5
baltrad/hdf5/bewid/
baltrad/hdf5/behel/
baltrad/hdf5/...
# daily csv
baltrad/csv-daily/bejab/2020/
baltrad/csv-daily/bejab/2020/bejab_vpts_202012.csv.gz
baltrad/csv-daily/bejab/2020/bejab_vpts_20200101.csv
baltrad/csv-daily/bejab/2020/bejab_vpts_20200102.csv
baltrad/csv-daily/bejab/2020/... # 365 unzipped files per directory
baltrad/csv-daily/bejab/2020/bejab_vpts_20200131.csv
# monthly csv
baltrad/csv-monthly/bejav/2020/bejab_vpts_202001.csv.gz
baltrad/csv-monthly/bejab/2020/... # 12 zipped files per directory
From my point of view (of someone who would write code accessing the repository - but not explore it by hand), I'm not a huge fan of all those subdirectories (radar/yyyy/mm/dd
), since they just repeat data that's already in the filename (bejab_vp_20200101T235500Z_0x9.h5
) and add a lot of verbose and error-prone path manipulations.
Actually from that approach having almost all files in a single flat directory would be perfectly fine (since the filename provides all the metadata already). That would also circumvent the discussions about which subdirectory (radar of year) should be at the highest level.
Now:
On our S3 server @ UvA I've steered away from a flat structure. I feel the same about the duplication of information in path and filename. However, I did notice quite a performance hit storing everything together in a single bucket when you want to, for example, provide an overview of unique radars, unique radar years, etc.
Thanks all for the input! There is a consensus on the structure suggested in https://github.com/enram/data-repository/issues/65#issuecomment-1103856289
source/format/radar/yyyy/mm/dd/
for hdf5source/format/radar/yyyy/
for csv data products.
Update: consensus for the structure suggested in https://github.com/enram/data-repository/issues/65#issuecomment-1103856289
Current
The vp data in the ENRAM data repository are currently organized as:
Design principles
Proposal
1. radar/yyyy
radar/yyyy
structure (and filename convention for vpts).yyyy/mm/dd/
(files for all radars). The BALTRAD PVOL archive usesyyyy/mm/dd/HH/MM/
(files for all radars). Although organizing by year first has some benefits, the fact that there is no radar directory, makes it hard for tools to find data for a specific radar, which is almost always part of the query.2. yyyy/radar
A valid alternative is switching radar and year columns: