HDFGroup / datacontainer

Data Container Study
Other
8 stars 1 forks source link

chunking/compression options for Cortad and aggregated NCEP #28

Closed jreadey closed 8 years ago

jreadey commented 8 years ago

Create different chunk layouts, compression filters for cortad dataset (similar to what we did for ncep). Describe in the repo datasets folder.

hyoklee commented 8 years ago

How big is Cortad file?

hyoklee commented 8 years ago

@ajelenak-thg I can't login to cortad instance.

jlee@griffin:~$ ssh -i osdc_keypair.pem 172.17.192.5
Permission denied (publickey).
hyoklee commented 8 years ago

@ajelenak-thg Never mind. I forgot to add ubuntu@172...

hyoklee commented 8 years ago

I see 84G for cortad on /dev/vdb. Is there an instance with /dev/vdb 168+ G?

ghost commented 8 years ago

How big is Cortad file?

Which one? There are 8 of them: s3cmd ls s3://hdfdata/cortad/.

hyoklee commented 8 years ago

One on 172.17.192.5.

hyoklee commented 8 years ago

Should I convert all 8 with different chunk and compression?

ghost commented 8 years ago

The files are already chunked (can't remember the actual chunk size) and compressed (gzip, level 1). We should do one version with chunking and compression and one with same chunking but no compression.

I don't know if we have enough space in the S3 store to do more than one compression filter.

hyoklee commented 8 years ago

@jready Please give me specific compression filter you're interested in and chunk shape - h5json/unidata. Also, which file or all files?

hyoklee commented 8 years ago

Here are the list of files.

2015-11-10 02:59 39811379359   s3://hdfdata/cortad/cortadv5_FilledSST.nc
2015-11-06 00:24 1310219402   s3://hdfdata/cortad/cortadv5_HarmonicsClimatology.nc
2015-11-10 03:18 28181078892   s3://hdfdata/cortad/cortadv5_MedfillSST.nc
2015-11-11 01:36 89505625289   s3://hdfdata/cortad/cortadv5_SSTA.nc
2015-11-06 00:27 349075356   s3://hdfdata/cortad/cortadv5_SeaIceFraction.nc
2015-11-10 21:48 67047743647   s3://hdfdata/cortad/cortadv5_TSA.nc
2015-11-10 04:07 25322865025   s3://hdfdata/cortad/cortadv5_WeeklySST.nc
2015-11-06 00:31 1465067450   s3://hdfdata/cortad/cortadv5_WindSpeed.nc
jreadey commented 8 years ago

Let's just start with the largest file: SSTA.nc.

Use the same compressors as for NCEP.

For chunk shape let's go with three varieties: 1) 2D chunk with timestep as 1 unit 2) 3D "squarish" chunks 3) Data rods style - 1 point time series

hyoklee commented 8 years ago

I'm not sure about 3). Does chunk size become 1x1x1? Please give me exact numbers for 1)-3).

jreadey commented 8 years ago

Why would it be 1x1x1? It would be 1x1xn where n is the length of the time dimension.

hyoklee commented 8 years ago

SSTA original property:

   DATASET "SSTA" {
      DATATYPE  H5T_STD_I16LE
      DATASPACE  SIMPLE { ( 1617, 4320, 8640 ) / ( H5S_UNLIMITED, 4320, 8640 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 101, 540, 540 )
         SIZE 47108409684 (2.562:1 COMPRESSION)
      }
      FILTERS {
         COMPRESSION DEFLATE { LEVEL 1 }
      }

So do you want to set chunk size like below?

  1. (1,540,540)
  2. (26, 68, 135) // based on Aleksandar's example.
  3. (1617, 1,1)

With the above 3 chunking and 3 compression methods (szip, blosc, mafisc) , it will create 80G * 9 = 720G. Do we have enough space?

Also, should I remove compression as @ajelenak-thg suggested? That will require another 100G.

jreadey commented 8 years ago

Since @ajelenak-thg is aggregating the ncep3 files, let's hold off on this for now.

hyoklee commented 8 years ago

@jready I'm about to repack the aggregated NCEP. The shape is 7850x720x1440. How do you want to set the chunk size?

According to your previous comment,

1) 2D chunk with time step as 1 unit 2) 3D "squarish" chunks 3) Data rods style - 1 point time series

I can set as follows.

  1. (1, 720, 1440)
  2. (123, 23, 45) // @ajelenak-thg's chunking.h5py_chunk suggested shape
  3. (7850, 1, 1)

My question is whether you'd like to shape chunk sizes to match the ones in the previous experiments like (45,180) in s3://hdfdata/ncep3_chunk_45_180_blosc_2_2_4/ for lat/lon dimensions.

ghost commented 8 years ago
  1. (123, 23, 45)

This one is very similar to the third case, i.e. the data rod. Did you try the Unidata formula? Another parameter that can be varied in these formulas is the chunk size in bytes. Both formulas use 1MiB which is equal to the default dataset chunk cache but what if we reduced the chunk size to, for example, 16KiB?

hyoklee commented 8 years ago

@ajelenak-thg, Unidata suggests (44,55,108).

>>> chunking.unidata_chunk([7850, 720, 1440])
(44, 55, 108)
jreadey commented 8 years ago

Which dimension is the time one?

hyoklee commented 8 years ago

7850.

jreadey commented 8 years ago

That should have been obvious to me since we have 7850 files in the original collection.

I'd suggest varying the layout in a way that keeps the size of each chunk about the same. Idea would be to not introduce new variability that is related to the chunk size vs. the layout shape.

How about this:

1) 1, 72, 144 2) 25,20,20 3) 7850, 1, 1

The chunk sizes would then be: 1) 10368 2) 10000 3) 7850

hyoklee commented 8 years ago

@jready It sounds good. I'll recreate files with gzip level=9 first.

hyoklee commented 8 years ago

It took 9.5 hours to repack for chunk size 1x72x144. I'm transferring the first repacked file to s3://hdfdata/ncep3_concat/. It will take a while to transfer the 3013 blocks of 15MB.

hyoklee commented 8 years ago

It took 6.0 hours to repack for chunk size 25x20x20. I'll transfer it to S3 when I finish transferring previous one.

hyoklee commented 8 years ago

It took 10.0 hours to repack for chunk size 7850x1x1. The 1x72x144 repacked file is now in S3.

I cannot transfer the other two repacked files because s3cmd gives an error.

 File "/usr/lib/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -2] Name or service not known
jreadey commented 8 years ago

Do we have the NCEP aggregation script checked in?

ghost commented 8 years ago

Yes, in the util directory.