Suggestion for directory structure

peterdesmet commented 2 years ago

Update: consensus for the structure suggested in https://github.com/enram/data-repository/issues/65#issuecomment-1103856289

Current

The vp data in the ENRAM data repository are currently organized as:

# hdf5 vp files organized in hour directories
be/jab/2020/05/07/01/bejab_vp_20200507T010000Z_0x9.h5
be/jab/2020/05/07/01/bejab_vp_20200507T010500Z_0x9.h5
be/jab/2020/05/07/01/... # 60/5 = 12 files per directory
be/jab/2020/05/07/01/bejab_vp_20200507T015500Z_0x9.h5

# zipped directories with all hdf5 vp data for that month
be/jab/2020/bejab202005.zip

Design principles

We still want to store the individual hdf5 files. That is how they are produced by vol2bird.
Sources of the vp data (BALTRAD, flyway, meteo offices) should be kept separate, so a) data pipelines don't conflict with each other and b) the user can choose.
Within a source, the best possible data is kept, e.g. when BALTRAD data is reprocessed with improved vol2bird settings, those data should overwrite existing (lower quality) data from BALTRAD.
We don't want the zipped directories of data anymore. They facilitate download speed, but not analysis speed. Rather, we want to offer the vpts data in a tabular format(https://github.com/enram/vpts-dp/issues/25). This data product will be used by:
- bioRad for analysis: data should be bulked and easy to download (e.g. zipped radar year).
- CROW for visualization: data should be chunked and be read directly (e.g. unzipped radar date)
The directory structure of the repository should be easy to navigate (especially for the data products), without too many subdirectories, so it can be understand by humans and e.g. bioRad functions to download data.

Proposal

1. radar/yyyy

# source/format/radar/yyyy/

# original hdf5 vp files
baltrad/h5/bejab/2020/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/bejab/2020/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/bejab/2020/... # 60/5*24*365 = 105.120 files per directory
baltrad/h5/bejab/2020/bejab_vp_20201231T235500Z_0x9.h5

# tabular data products (zipped and unzipped)
baltrad/csv/bejab/2020/
baltrad/csv/bejab/2020/bejab_vpts_2020.csv.gz
baltrad/csv/bejab/2020/bejab_vpts_20200101.csv
baltrad/csv/bejab/2020/bejab_vpts_20200102.csv
baltrad/csv/bejab/2020/... # 365 files per directory
baltrad/csv/bejab/2020/bejab_vpts_20201231.csv

RMI data repository uses the same radar/yyyy structure (and filename convention for vpts).
The US data (#54) data are organized as yyyy/mm/dd/ (files for all radars). The BALTRAD PVOL archive uses yyyy/mm/dd/HH/MM/ (files for all radars). Although organizing by year first has some benefits, the fact that there is no radar directory, makes it hard for tools to find data for a specific radar, which is almost always part of the query.

2. yyyy/radar

A valid alternative is switching radar and year columns:

# source/format/yyyy/radar/

# original hdf5 vp files
baltrad/h5/2020/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/bejab/... # 60/5*24*365 = 105.120 files per directory
baltrad/h5/2020/bejab/bejab_vp_20201231T235500Z_0x9.h5

# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_2020.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 files per directory
baltrad/csv/2020/bejab/bejab_vpts_20201231.csv

peterdesmet commented 2 years ago

@adokter @bart1 @niconoe @CeciliaNilsson709 @baptischmi @Rafnuss Feedback welcome on the above proposal.

CeciliaNilsson709 commented 2 years ago

I think I slightly prefer # source/format/radar/yyyy/, but either way it will fit some users and not others. Tools will have to be adapted between the US structure and this anyway even if we put year first too.

niconoe commented 2 years ago

No strong preference between the two proposals. I agree with all the design principles and think we should stick to one of the two proposals.

I'm not a fan of the approaches that use too many levels such as yyyy/mm/dd/HH/MM/: little added value, but add a lot of heavy path manipulations in tools, in my experience.

adokter commented 2 years ago

Hi @peterdesmet, here a few thoughts:

I've found that directory structure matters a lot in terms of how quickly you can fetch the data from AWS, and you want to avoid too many files in a single directory as it really slows downloads and searches for available files. Therefore, for h5 I would add a deeper directory structure than for csv.
My preference is a directory structure with date before radar. The reason is this better suits full-network analyses, in which you want to be able to efficiently grab all the available radar data for a given date or period. Such downloads are much more fast when date is up front in the tree. In fact I have processed vp data into radar/date order in the past, and I've come to regret it because of how slow data retrieval from s3 becomes.
A main downside of a date/radar ordering is that downloading multiple years for a single radar requires more cycling through the tree. But I've found that is typically fairly easy, because the set of potentially available dates to search for is well defined (while the set of potentially available radars is not). I suspect these considerations also led NEXRAD to order date before radar, see https://s3.amazonaws.com/noaa-nexrad-level2/index.html
Taken these together, for h5 I would recommend source/format/yyyy/mm/dd/radar, and for csv source/format/yyyy/radar
Why daily csv files and not monthly? I find monthly files a nice compromise between file size and having a substantial period in the file. Daily files are very short and you end up with many, while year files get very big.

baptischmi commented 2 years ago

Thank you @peterdesmet

I'll support @adokter suggestion, although # source/format/radar/yyyy/ allow to quickly get the temporal coverage of a radar (which is not granted in Europe).

As mentioned by @niconoe , its good to have few levels only (as proposed here in contract to yyyy/mm/dd/HH/MM/) for monthly csv tables, but as mentioned by @adokter maybe not for single h5-files, since having too many files slows the search and download of files.

bart1 commented 2 years ago

@peterdesmet thanks for the explanation. For clarity I will first clarify what we use @ uva to deal with different projects and data. This is a somewhat similar problem to the pipeline issue. As we deal with both pvols and vps and both can have their own differences we have a 2 tiered system for pvols and three for vps. The structure is: project/pvol_settings for pvols and project/pvol_settings/vp_settings for vp. Each vp in this structure can be easily referred back to a pvol. We quite regularly end up exploring different settings for both constructing the pvols and vp. It might be worth considering that vp can be calculated with different settings. @berendwijers do you have anything to add to this?

On the order of year and radar I do prefer radar first as it allows for a quick overview of what time period a radar covers. Although it is hard to know how much of that is just the habit of always having it like that.

peterdesmet commented 2 years ago

Thanks for the suggestions all!

@adokter: Therefore, for h5 I would add a deeper directory structure than for csv.

Agreed, that is a worry I had too.

@adokter My preference is a directory structure with date before radar. The reason is this better suits full-network analyses, in which you want to be able to efficiently grab all the available radar data for a given date or period. vs @baptischmi @bart1 ... quick overview of what time period a radar covers ...

Even though Europe probably lends itself less to full-network analyses than the US, I can see how efficiently sampling a time period will probably always be part of an analyses. An overview per radar can likely be provided as summary data, cf. the coverage.csv file we currently have. So I'd support putting temporal information first.

@adokter Why daily csv files and not monthly?

For CROW small files are better (since it's all done in browser); we currently use daily files. For bioRad we could produce monthly files over yearly files if that is more convenient? If we do, I would introduce month directories.

@bart1 We quite regularly end up exploring different settings for both constructing the pvols and vp.

That makes sense for UvA, but for the data repository I hope to provide a consensus view, where the best possible data is given for a certain source. We already add the complexity of choosing between different sources, I'd like to avoid adding the complexity of having to choose between different processing too.

3. yyyy/mm/radar

Given that monthly files might be more convenient and we don't want too deep a file structure, we could use month directories for all? Or would you prefer to keep the day as part of the path?

# source/format/yyyy/mm/radar/

# original hdf5 vp files
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/bejab/... # 60/5*24*31 = 8.556 files per directory
baltrad/h5/2020/01/bejab/bejab_vp_20200131T235500Z_0x9.h5 # last file for that month
baltrad/h5/2020/01/bewid/
baltrad/h5/2020/02/behel/

# tabular data products (zipped and unzipped)
baltrad/csv/2020/01/bejab/
baltrad/csv/2020/01/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/01/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/01/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/01/bejab/... # 31 files per directory
baltrad/csv/2020/01/bejab/bejab_vpts_20200131.csv

peterdesmet commented 2 years ago

4. yyyy/mm/dd/radar

To more easily compare, here's the structure suggested by @adokter. It mimics the US structure (for the h5 data).

# source/format/yyyy/mm/dd/radar/

# original hdf5 vp files
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day
baltrad/h5/2020/01/01/bewid/
baltrad/h5/2020/02/01/behel/

# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/bejab/... # 12 zipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 unzipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_20200131.csv

Personally I think that structure works quite well and is better than the month directories I suggested above.

bart1 commented 2 years ago

# source/format/yyyy/mm/dd/radar/

# original hdf5 vp files
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day
baltrad/h5/2020/01/01/bewid/
baltrad/h5/2020/02/01/behel/

# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/bejab/... # 12 zipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 unzipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_20200131.csv

I have been thinking a bit more about this

With this structure checking data availability for long time series in the original vp data gets slightly more difficult as many directories need to be searched. Especially as radars/data stream might be on and of some what frequent (leading to days missing). As most people might be accessing data through the csv this is maybe not so much of an issue. But for data quality checking I tend to first go by radar.

This is might be a wider argument for going for radar first as the first step for the analysis at least for European data is frequently doing a quality check, including checks if the quality changed over time. A structure with radar first might facilitate this.

Also I do imagine that people are more likely to to analysis with a limited geographic scope compared to a limited temporal scope. Here I'm more thinking about for example ecological consulting interested in a region. The people wanting to do analysis on a large geographic scope I guess also want a long time series and are likely more technically savy.

peterdesmet commented 2 years ago

Thanks for the input @bart1, I agree with your arguments. The European data are not homogenous in quality or coverage like the US data, but very radar (country) dependent. So it makes sense to be able to select on radar up front. And as @CeciliaNilsson709 mentions, we'll likely require different functions anyway to query US vs EU data, so aligning is only partly useful. In any case, I don't think there is a right decision here, we just need to make one.

So, suggestion 5. 👍, 👎, feedback welcome.

5. radar/yyyy/mm/dd

# source/format/radar/yyyy/mm/dd/

# original hdf5 vp files: same as original proposal but deeper hierarchy
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000000Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000500Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/... # 60/5*24 = 288 files per directory
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T235500Z_0x9.h5
baltrad/hdf5/bewid/
baltrad/hdf5/behel/
baltrad/hdf5/...

# daily csv
baltrad/csv-daily/bejab/2020/
baltrad/csv-daily/bejab/2020/bejab_vpts_202012.csv.gz
baltrad/csv-daily/bejab/2020/bejab_vpts_20200101.csv
baltrad/csv-daily/bejab/2020/bejab_vpts_20200102.csv
baltrad/csv-daily/bejab/2020/... # 365 unzipped files per directory
baltrad/csv-daily/bejab/2020/bejab_vpts_20200131.csv

# monthly csv
baltrad/csv-monthly/bejav/2020/bejab_vpts_202001.csv.gz
baltrad/csv-monthly/bejab/2020/... # 12 zipped files per directory

niconoe commented 2 years ago

From my point of view (of someone who would write code accessing the repository - but not explore it by hand), I'm not a huge fan of all those subdirectories (radar/yyyy/mm/dd), since they just repeat data that's already in the filename (bejab_vp_20200101T235500Z_0x9.h5) and add a lot of verbose and error-prone path manipulations.

Actually from that approach having almost all files in a single flat directory would be perfectly fine (since the filename provides all the metadata already). That would also circumvent the discussions about which subdirectory (radar of year) should be at the highest level.

Now:

I understand that this approach is probably too radical and less suited for human exploration of the repository
It's still okay to deal with all those subdirectories if that's your choice, just less handy :)
that brings another question: how will the repository will accessed: web interface? directly through S3? Both? That might impact what we do (in some cases, it might make sense to store the files flatly on the system, but to have the interface actually show it to the users in "virtual" subdirectories)

BerendWijers commented 2 years ago

On our S3 server @ UvA I've steered away from a flat structure. I feel the same about the duplication of information in path and filename. However, I did notice quite a performance hit storing everything together in a single bucket when you want to, for example, provide an overview of unique radars, unique radar years, etc.

peterdesmet commented 2 years ago

Thanks all for the input! There is a consensus on the structure suggested in https://github.com/enram/data-repository/issues/65#issuecomment-1103856289

source/format/radar/yyyy/mm/dd/ for hdf5
source/format/radar/yyyy/ for csv data products.

enram / data-repository