Closed jreadey closed 8 years ago
How big is Cortad file?
@ajelenak-thg I can't login to cortad instance.
jlee@griffin:~$ ssh -i osdc_keypair.pem 172.17.192.5
Permission denied (publickey).
@ajelenak-thg Never mind. I forgot to add ubuntu@172...
I see 84G for cortad on /dev/vdb. Is there an instance with /dev/vdb 168+ G?
How big is Cortad file?
Which one? There are 8 of them: s3cmd ls s3://hdfdata/cortad/
.
One on 172.17.192.5.
Should I convert all 8 with different chunk and compression?
The files are already chunked (can't remember the actual chunk size) and compressed (gzip, level 1). We should do one version with chunking and compression and one with same chunking but no compression.
I don't know if we have enough space in the S3 store to do more than one compression filter.
@jready Please give me specific compression filter you're interested in and chunk shape - h5json/unidata. Also, which file or all files?
Here are the list of files.
2015-11-10 02:59 39811379359 s3://hdfdata/cortad/cortadv5_FilledSST.nc
2015-11-06 00:24 1310219402 s3://hdfdata/cortad/cortadv5_HarmonicsClimatology.nc
2015-11-10 03:18 28181078892 s3://hdfdata/cortad/cortadv5_MedfillSST.nc
2015-11-11 01:36 89505625289 s3://hdfdata/cortad/cortadv5_SSTA.nc
2015-11-06 00:27 349075356 s3://hdfdata/cortad/cortadv5_SeaIceFraction.nc
2015-11-10 21:48 67047743647 s3://hdfdata/cortad/cortadv5_TSA.nc
2015-11-10 04:07 25322865025 s3://hdfdata/cortad/cortadv5_WeeklySST.nc
2015-11-06 00:31 1465067450 s3://hdfdata/cortad/cortadv5_WindSpeed.nc
Let's just start with the largest file: SSTA.nc.
Use the same compressors as for NCEP.
For chunk shape let's go with three varieties: 1) 2D chunk with timestep as 1 unit 2) 3D "squarish" chunks 3) Data rods style - 1 point time series
I'm not sure about 3). Does chunk size become 1x1x1? Please give me exact numbers for 1)-3).
Why would it be 1x1x1? It would be 1x1xn where n is the length of the time dimension.
SSTA original property:
DATASET "SSTA" {
DATATYPE H5T_STD_I16LE
DATASPACE SIMPLE { ( 1617, 4320, 8640 ) / ( H5S_UNLIMITED, 4320, 8640 ) }
STORAGE_LAYOUT {
CHUNKED ( 101, 540, 540 )
SIZE 47108409684 (2.562:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 1 }
}
So do you want to set chunk size like below?
With the above 3 chunking and 3 compression methods (szip, blosc, mafisc) , it will create 80G * 9 = 720G. Do we have enough space?
Also, should I remove compression as @ajelenak-thg suggested? That will require another 100G.
Since @ajelenak-thg is aggregating the ncep3 files, let's hold off on this for now.
@jready I'm about to repack the aggregated NCEP. The shape is 7850x720x1440. How do you want to set the chunk size?
According to your previous comment,
1) 2D chunk with time step as 1 unit 2) 3D "squarish" chunks 3) Data rods style - 1 point time series
I can set as follows.
My question is whether you'd like to shape chunk sizes to match the ones in the previous experiments like (45,180) in s3://hdfdata/ncep3_chunk_45_180_blosc_2_2_4/ for lat/lon dimensions.
- (123, 23, 45)
This one is very similar to the third case, i.e. the data rod. Did you try the Unidata formula? Another parameter that can be varied in these formulas is the chunk size in bytes. Both formulas use 1MiB which is equal to the default dataset chunk cache but what if we reduced the chunk size to, for example, 16KiB?
@ajelenak-thg, Unidata suggests (44,55,108).
>>> chunking.unidata_chunk([7850, 720, 1440])
(44, 55, 108)
Which dimension is the time one?
7850.
That should have been obvious to me since we have 7850 files in the original collection.
I'd suggest varying the layout in a way that keeps the size of each chunk about the same. Idea would be to not introduce new variability that is related to the chunk size vs. the layout shape.
How about this:
1) 1, 72, 144 2) 25,20,20 3) 7850, 1, 1
The chunk sizes would then be: 1) 10368 2) 10000 3) 7850
@jready It sounds good. I'll recreate files with gzip level=9 first.
It took 9.5 hours to repack for chunk size 1x72x144. I'm transferring the first repacked file to s3://hdfdata/ncep3_concat/. It will take a while to transfer the 3013 blocks of 15MB.
It took 6.0 hours to repack for chunk size 25x20x20. I'll transfer it to S3 when I finish transferring previous one.
It took 10.0 hours to repack for chunk size 7850x1x1. The 1x72x144 repacked file is now in S3.
I cannot transfer the other two repacked files because s3cmd gives an error.
File "/usr/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -2] Name or service not known
Do we have the NCEP aggregation script checked in?
Yes, in the util directory.
Create different chunk layouts, compression filters for cortad dataset (similar to what we did for ncep). Describe in the repo datasets folder.