Run summary and series scripts on aggregated NCEP data

jreadey commented 9 years ago

Run these scripts and capture benchmark times for the different chunk layouts. Just use one node for now. You can run a ipyparallel cluster locally by running: ipcluster start -n 1.

hyoklee commented 9 years ago

Can I bypass ipcluster? I think I can run summary / series scripts directly on issue28 instance that already has re-packed files on local file system under /mnt/data.

jreadey commented 9 years ago

The scripts won't run without ipcluster.

hyoklee commented 9 years ago

I think all I need to run is the following part in summary.py:

       with h5py.File(file_path, 'r') as f:
            dset = f[h5path]

            # mask fill value
            if '_FillValue' in dset.attrs:
        arr = dset[...]
        fill = dset.attrs['_FillValue'][0]
                v = arr[arr != fill]
            else:
                v = dset[...]
            # file name GSSTF_NCEP.3.YYYY.MM.DD.he5

            return_values.append( (file_name, len(v), numpy.min(v), numpy.max(v), \
numpy.mean(v),
                numpy.median(v), numpy.std(v) ) )

Can I modify the above summary.py and just calculate min / max / std? I don't want to waste time in copying 40G+ files from S3.

jreadey commented 9 years ago

If the files are in the local s3 directory the s3downloader won't re-copy them.

hyoklee commented 9 years ago

I know. What I am saying is that since I'm already running an instance, I want to avoid launching any new instance that doesn't have repacked files locally. I'd like to compute results directly on the instance (issue28 instance) that I'm already running.

hyoklee commented 9 years ago

I get memory error from summary code when I ran it against

file_path = '/mnt/data/GSSTF_NCEP.3.concat.1x72x144.gzip9.h5'
h5path = '/HDFEOS/GRIDS/NCEP/Data Fields/Tair_2m'

The below is error message.

ubuntu@issue28:~/datacontainer/filters$ python summary_local.py 
start processing
Traceback (most recent call last):
  File "summary_local.py", line 21, in <module>
    v = arr[arr != fill]
MemoryError

I don't think aggregated file can be summarized in OSDC environment.

jreadey commented 9 years ago

There no new instance involved. Setup would be like this: 1) ssh to your existing instance 2) make sure the data files reside in the /mnt/s3 directory 3) run: $ ipcluster start -n 1 # this creates a process on the existing machine 4) run summary.py with usual args

jreadey commented 9 years ago

Re: the error...

You are trying to read the entire dataset into memory and do a boolean selection on it. You'll need to read slices from the dataset (one slice by day) and do the calculation on that. This way we should get the same results of running summary.py over 7850 files (one file per day) vs. one file (with slices per day).

hyoklee commented 9 years ago

I already ran summary and it cannot be run successfully on 2 repacked files due to memory error. I'm waiting for 3rd chunk shape but I think the result will be same.

Do you want me to try splitting dataset? That is, calculate min/max/std for Tair_2m[0][:][:], Tair_2m[1][:][:], ... Tair_2m[7849][:][:] and see if summary script works?

jreadey commented 9 years ago

Our comments passed each other, but yes that's my suggestion.

hyoklee commented 9 years ago

So, I need to rewrite some part of summary.py code. Correct?

jreadey commented 9 years ago

Yes, a quick hack would be to check the shape. If the rank is 2, run the code as is. If the rank is 3, do calculation per slice and return list of results.

hyoklee commented 9 years ago

It took 1.15 hours for 25x20x20 chunk. I'll keep posting results.

>>>>> runtime: 4130.991s
[('/mnt/data/GSSTF_NCEP.3.concat.25x20x20.gzip9.h5', 507925, -21.329285, 31.260193\
, 17.161457, 19.44809, 9.7533665), ('/mnt/data/GSSTF_NCEP.3.concat.25x20x20.gzip9.\
h5', 507997, -20.942642, 30.541962, 17.167328, 19.450409, 9.7760715),...

hyoklee commented 9 years ago

It took 5.3 minutes for 1x72x144 chunk.

>>>>> runtime: 317.642s
[('/mnt/data/GSSTF_NCEP.3.concat.1x72x144.gzip9.h5', 507925, -21.329285, 31.260193\
, 17.161457, 19.44809, 9.7533665), ('/mnt/data/GSSTF_NCEP.3.concat.1x72x144.gzip9.\
h5', 507997, -20.942642, 30.541962, 17.167328, 19.450409, 9.7760715),

jreadey commented 9 years ago

That's a big difference. This is summary.py?

hyoklee commented 9 years ago

Yes, summary.py that was modified to use subset. I'll post the code under filter/ later. It's not a surprise to me.

jreadey commented 9 years ago

Can you update the results.txt file with your latest?

hyoklee commented 9 years ago

The 7850,1,1 chunk shape test is still running. I'll update it as soon as it's done.

hyoklee commented 8 years ago

For Summary task, one subset took 174.547s for 7850x1x1 chunk shape. 3 minutes * 7850 = 23550 minutes = 392.5 hours. It will take 16.354167 days.

hyoklee commented 8 years ago

I tried to put the concat file again but it failed near the end.

'GSSTF_NCEP.3.concat.h5' -> 's3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5'  [part 8181 of 8280, 15MB]
ERROR: 
Upload of 'GSSTF_NCEP.3.concat.h5' part 8181 failed. Use
  /home/ubuntu/s3cmd/s3cmd abortmp s3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5 2~PAJDfNCu6wD_wNBBEORtV0nFxZ8Xr4I
to abort the upload, or
  /home/ubuntu/s3cmd/s3cmd --upload-id 2~PAJDfNCu6wD_wNBBEORtV0nFxZ8Xr4I put ...
to continue the upload.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please try reproducing the error using
  the latest s3cmd code from the git master
  branch found at:
    https://github.com/s3tools/s3cmd
  and have a look at the known issues list:
    https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions
  If the error persists, please report the
  following lines (removing any private
  info as necessary) to:
   s3tools-bugs@lists.sourceforge.net

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Invoked as: /home/ubuntu/s3cmd/s3cmd -c /home/ubuntu/config/s3_griffin.cfg put GSSTF_NCEP.3.concat.h5 s3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5
Problem: IOError: [Errno 2] No such file or directory: 'GSSTF_NCEP.3.concat.h5'
S3cmd:   1.6.0+
python:   2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2]
environment LANG=en_US.UTF-8

Traceback (most recent call last):
  File "/home/ubuntu/s3cmd/s3cmd", line 2813, in <module>
    rc = main()
  File "/home/ubuntu/s3cmd/s3cmd", line 2721, in main
    rc = cmd_func(args)
  File "/home/ubuntu/s3cmd/s3cmd", line 384, in cmd_object_put
    response = s3.object_put(full_name, uri_final, extra_headers, extra_label = seq_label)
  File "/home/ubuntu/s3cmd/S3/S3.py", line 600, in object_put
    return self.send_file_multipart(file, headers, uri, size)
  File "/home/ubuntu/s3cmd/S3/S3.py", line 1304, in send_file_multipart
    upload.upload_all_parts()
  File "/home/ubuntu/s3cmd/S3/MultiPart.py", line 113, in upload_all_parts
    self.upload_part(seq, offset, current_chunk_size, labels, remote_status = remote_statuses.get(seq))
  File "/home/ubuntu/s3cmd/S3/MultiPart.py", line 167, in upload_part
    response = self.s3.send_file(request, self.file, labels, buffer, offset = offset, chunk_size = chunk_size)
  File "/home/ubuntu/s3cmd/S3/S3.py", line 1155, in send_file
    sha256_hash = checksum_sha256_file(filename, offset, size_total)
  File "/home/ubuntu/s3cmd/S3/Crypto.py", line 163, in checksum_sha256_file
    with open(deunicodise(filename),'rb') as f:
IOError: [Errno 2] No such file or directory: 'GSSTF_NCEP.3.concat.h5'

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please try reproducing the error using
  the latest s3cmd code from the git master
  branch found at:
    https://github.com/s3tools/s3cmd
  and have a look at the known issues list:
    https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions
  If the error persists, please report the
  above lines (removing any private
  info as necessary) to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

jreadey commented 8 years ago

I've already copied this file to S3 as GSSTF_NCEP.3_concat.h5, but if you want to experiment try the --multipart-chunk-size-mb=512 option that Sean recommended.

hyoklee commented 8 years ago

--multipart-chunk-size-mb=512 option worked.

Try s3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5 now.

hyoklee commented 8 years ago

@jreadey It seems that you renamed the file /mnt/data/GSSTF_NCEP.3.concat.h5 to /mnt/data/GSSTF_NCEP.3_concat.h5 on issue28 instance during my transfer yesterday. That caused the transfer error.

jreadey commented 8 years ago

Sorry about that, I didn't know you had a copy in progress.

HDFGroup / datacontainer

Run summary and series scripts on aggregated NCEP data #30