enram / data-repository

Data quality assessment
https://enram.github.io/data-repository/
MIT License
3 stars 1 forks source link

Move ENRAM repository to new bucket and directory structure #66

Closed peterdesmet closed 1 year ago

peterdesmet commented 2 years ago

See this post to flatten file structure

  1. Check if a radar year contains unique files only:
aws s3 ls lw-enram/be/jab/2020/ --recursive | awk '{print $4}' | xargs -I {} basename {} | uniq -d
# Should result in 0 files
  1. Move files:
aws s3 ls lw-enram/be/jab/2020/ --recursive | awk '{print $4}' | xargs -I {} sh -c 'aws s3 cp s3://lw-enram/{} s3://enram-vp/baltrad/hdf5/bejab/2020/$(basename {}) --dryrun'

More elaborate example using variables (that currently returns an error):

aws s3 ls lw-enram/be/jab/2020/ --recursive | awk '{print $4}' | xargs -I % sh -c 'year=$(echo % | cut -d'/' -f 3);file=$(basename %);aws s3 cp s3://lw-enram/% s3://lw-enram/baltrad/h5/$year/$file --dryrun'

flyway files:

baltrad files:

peterdesmet commented 2 years ago

@niconoe Would be good if I can define the country year as variables. My attempt only works for the source path, not the destination path:

country="be"
radar="jab"
year="2020"

aws s3 ls lw-enram/$country/$radar/$year/ 
peterdesmet commented 2 years ago

Pseudo code for copying files:

source_bucket = "s3://lw-enram"
dest_bucket = "s3://aloft"

for path in source_bucket:
  # Example source path: "s3://lw-enram/be/jab/2020/02/05/00/bejab_vp_20200205T004000Z_0x9.h5"

  # Parse path
  radar = dir1 & dir2 # bejab
  year = dir3         # 2020
  month = dir4        # 02
  day = dir5          # 05
  file = basename     # bejab_vp_20200205T004000Z_0x9.h5
  file_ext = extension # h5

  # Set source
  if year = 2016:
    source = "ecog-04003"
  else:
    source = "baltrad"

  # Copy file
    if file_ext != "h5"
      skip
    if file exists at destination:
      skip
    else:
      copy file to {dest_bucket}/{source}/hdf5/{radar}/{year}/{month}/{day}/{file}
      # Example dest path: "s3://aloft/baltrad/hdf5/bejab/2020/02/05/bejab_vp_20200205T004000Z_0x9.h5"
niconoe commented 2 years ago

Update: implementation in progress (simple Python scripts, just requires the boto3 package).

peterdesmet commented 2 years ago

There is now a consensus on how the repo should be structured, see #65. @niconoe let me know when your code is ready, so we can start copying data.