enram / data-repository

Data quality assessment
https://enram.github.io/data-repository/
MIT License
3 stars 1 forks source link

Store data in a file repository #2

Closed peterdesmet closed 8 years ago

peterdesmet commented 8 years ago

Do we really need to build this? And if so, how do we keep it simple?

It depends: storing the data on our own/LifeWatch servers gives us more control. Another option is to depend on the BALTRAD infrastructure for hosting, backing up and giving access to our data. @leijnse, what is advisable? A: Data repository is required

How do we organize our data?

To limit the scope, we kinda decided to not have a database with the raw data, but rather have researchers download the data as files. That means that the data should ideally be organized in such a way that you can easily download logical packages. E.g. by having file format (e.g. hdf5) high up in the hierarchy, you avoid that users have to download "duplicate" data (in different formats) when selecting a directory to download.

@leijnse @hvangasteren, will users be more interested to download a time period for all radars or a radar for all time periods?

Suggested structure: /fileformat/yyyy/mm/dd/radar_id.hdf5 where each file contains 24h of data.

adokter commented 8 years ago

The BALTRAD infrastructure is built for real-time applications, not for long-term storage. So we need our own data repository. KNMI can host longer-term data, so the choice should be KNMI or our own data storage.

adokter commented 8 years ago

Default directory tree will be radar/yyyy/mm/dd/hh/radar_yyyymmddhhmm.h5, where each file contains one profile for one radar at one time instant only. Baltrad provides data at 15 min interval, so 4 files per directory

peterdesmet commented 8 years ago

@adokter: if the h5 file contains data from one radar for a 15 minute interval, the /dd directory will contain 4 * 24 (96) files, no?

adokter commented 8 years ago

@peterdesmet: sorry typo, also hour is a subfolder

bartaelterman commented 8 years ago

We'll upload everything to an Amazon S3 bucket.

@peterdesmet do you want me to include code to create the bucket as well? We'll be currently running this in our sandbox environment, meaning the bucket can be gone at the end of the day.

peterdesmet commented 8 years ago

@bartaelterman code to create the bucket: that would be useful.

adokter commented 8 years ago

@peterdesmet @bartaelterman

A different service worth considering is this EU-funded Dropbox-like service designed specifically for the exchange and storage of research data:

https://eudat.eu/

They offer 20GB of storage. Given that one bird profile is 5kB (zipped), it can store 4M profiles. A network of 100 radars produces 4*24*365*100=3.5M profiles per year. So a limitation there, but we could inform whether these limits can be enlarged

peterdesmet commented 8 years ago

Since we can easily setup things on S3, I would choose that for now. I would revisit EUDAT or any of the other possible repository once we have a functional data flow. Closing this issue: we now that a data repository is needed.