Closed jreadey closed 8 years ago
Can I bypass ipcluster? I think I can run summary / series scripts directly on issue28 instance that already has re-packed files on local file system under /mnt/data.
The scripts won't run without ipcluster.
I think all I need to run is the following part in summary.py:
with h5py.File(file_path, 'r') as f:
dset = f[h5path]
# mask fill value
if '_FillValue' in dset.attrs:
arr = dset[...]
fill = dset.attrs['_FillValue'][0]
v = arr[arr != fill]
else:
v = dset[...]
# file name GSSTF_NCEP.3.YYYY.MM.DD.he5
return_values.append( (file_name, len(v), numpy.min(v), numpy.max(v), \
numpy.mean(v),
numpy.median(v), numpy.std(v) ) )
Can I modify the above summary.py and just calculate min / max / std? I don't want to waste time in copying 40G+ files from S3.
If the files are in the local s3 directory the s3downloader won't re-copy them.
I know. What I am saying is that since I'm already running an instance, I want to avoid launching any new instance that doesn't have repacked files locally. I'd like to compute results directly on the instance (issue28 instance) that I'm already running.
I get memory error from summary code when I ran it against
file_path = '/mnt/data/GSSTF_NCEP.3.concat.1x72x144.gzip9.h5'
h5path = '/HDFEOS/GRIDS/NCEP/Data Fields/Tair_2m'
The below is error message.
ubuntu@issue28:~/datacontainer/filters$ python summary_local.py
start processing
Traceback (most recent call last):
File "summary_local.py", line 21, in <module>
v = arr[arr != fill]
MemoryError
I don't think aggregated file can be summarized in OSDC environment.
There no new instance involved. Setup would be like this: 1) ssh to your existing instance 2) make sure the data files reside in the /mnt/s3 directory 3) run: $ ipcluster start -n 1 # this creates a process on the existing machine 4) run summary.py with usual args
Re: the error...
You are trying to read the entire dataset into memory and do a boolean selection on it. You'll need to read slices from the dataset (one slice by day) and do the calculation on that. This way we should get the same results of running summary.py over 7850 files (one file per day) vs. one file (with slices per day).
I already ran summary and it cannot be run successfully on 2 repacked files due to memory error. I'm waiting for 3rd chunk shape but I think the result will be same.
Do you want me to try splitting dataset? That is, calculate min/max/std for Tair_2m[0][:][:], Tair_2m[1][:][:], ... Tair_2m[7849][:][:] and see if summary script works?
Our comments passed each other, but yes that's my suggestion.
So, I need to rewrite some part of summary.py code. Correct?
Yes, a quick hack would be to check the shape. If the rank is 2, run the code as is. If the rank is 3, do calculation per slice and return list of results.
It took 1.15 hours for 25x20x20 chunk. I'll keep posting results.
>>>>> runtime: 4130.991s
[('/mnt/data/GSSTF_NCEP.3.concat.25x20x20.gzip9.h5', 507925, -21.329285, 31.260193\
, 17.161457, 19.44809, 9.7533665), ('/mnt/data/GSSTF_NCEP.3.concat.25x20x20.gzip9.\
h5', 507997, -20.942642, 30.541962, 17.167328, 19.450409, 9.7760715),...
It took 5.3 minutes for 1x72x144 chunk.
>>>>> runtime: 317.642s
[('/mnt/data/GSSTF_NCEP.3.concat.1x72x144.gzip9.h5', 507925, -21.329285, 31.260193\
, 17.161457, 19.44809, 9.7533665), ('/mnt/data/GSSTF_NCEP.3.concat.1x72x144.gzip9.\
h5', 507997, -20.942642, 30.541962, 17.167328, 19.450409, 9.7760715),
That's a big difference. This is summary.py?
Yes, summary.py that was modified to use subset. I'll post the code under filter/ later. It's not a surprise to me.
Can you update the results.txt file with your latest?
The 7850,1,1 chunk shape test is still running. I'll update it as soon as it's done.
For Summary task, one subset took 174.547s for 7850x1x1 chunk shape. 3 minutes * 7850 = 23550 minutes = 392.5 hours. It will take 16.354167 days.
I tried to put the concat file again but it failed near the end.
'GSSTF_NCEP.3.concat.h5' -> 's3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5' [part 8181 of 8280, 15MB]
ERROR:
Upload of 'GSSTF_NCEP.3.concat.h5' part 8181 failed. Use
/home/ubuntu/s3cmd/s3cmd abortmp s3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5 2~PAJDfNCu6wD_wNBBEORtV0nFxZ8Xr4I
to abort the upload, or
/home/ubuntu/s3cmd/s3cmd --upload-id 2~PAJDfNCu6wD_wNBBEORtV0nFxZ8Xr4I put ...
to continue the upload.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
An unexpected error has occurred.
Please try reproducing the error using
the latest s3cmd code from the git master
branch found at:
https://github.com/s3tools/s3cmd
and have a look at the known issues list:
https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions
If the error persists, please report the
following lines (removing any private
info as necessary) to:
s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Invoked as: /home/ubuntu/s3cmd/s3cmd -c /home/ubuntu/config/s3_griffin.cfg put GSSTF_NCEP.3.concat.h5 s3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5
Problem: IOError: [Errno 2] No such file or directory: 'GSSTF_NCEP.3.concat.h5'
S3cmd: 1.6.0+
python: 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2]
environment LANG=en_US.UTF-8
Traceback (most recent call last):
File "/home/ubuntu/s3cmd/s3cmd", line 2813, in <module>
rc = main()
File "/home/ubuntu/s3cmd/s3cmd", line 2721, in main
rc = cmd_func(args)
File "/home/ubuntu/s3cmd/s3cmd", line 384, in cmd_object_put
response = s3.object_put(full_name, uri_final, extra_headers, extra_label = seq_label)
File "/home/ubuntu/s3cmd/S3/S3.py", line 600, in object_put
return self.send_file_multipart(file, headers, uri, size)
File "/home/ubuntu/s3cmd/S3/S3.py", line 1304, in send_file_multipart
upload.upload_all_parts()
File "/home/ubuntu/s3cmd/S3/MultiPart.py", line 113, in upload_all_parts
self.upload_part(seq, offset, current_chunk_size, labels, remote_status = remote_statuses.get(seq))
File "/home/ubuntu/s3cmd/S3/MultiPart.py", line 167, in upload_part
response = self.s3.send_file(request, self.file, labels, buffer, offset = offset, chunk_size = chunk_size)
File "/home/ubuntu/s3cmd/S3/S3.py", line 1155, in send_file
sha256_hash = checksum_sha256_file(filename, offset, size_total)
File "/home/ubuntu/s3cmd/S3/Crypto.py", line 163, in checksum_sha256_file
with open(deunicodise(filename),'rb') as f:
IOError: [Errno 2] No such file or directory: 'GSSTF_NCEP.3.concat.h5'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
An unexpected error has occurred.
Please try reproducing the error using
the latest s3cmd code from the git master
branch found at:
https://github.com/s3tools/s3cmd
and have a look at the known issues list:
https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions
If the error persists, please report the
above lines (removing any private
info as necessary) to:
s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
I've already copied this file to S3 as GSSTF_NCEP.3_concat.h5, but if you want to experiment try the --multipart-chunk-size-mb=512 option that Sean recommended.
--multipart-chunk-size-mb=512 option worked.
Try s3://hdfdata/ncep3_concat/GSSTF_NCEP.3.concat.20151207.h5 now.
@jreadey It seems that you renamed the file /mnt/data/GSSTF_NCEP.3.concat.h5 to /mnt/data/GSSTF_NCEP.3_concat.h5 on issue28 instance during my transfer yesterday. That caused the transfer error.
Sorry about that, I didn't know you had a copy in progress.
Run these scripts and capture benchmark times for the different chunk layouts. Just use one node for now. You can run a ipyparallel cluster locally by running: ipcluster start -n 1.