ESIPFed / esiphub-dev

Development JupyterHub on AWS targeting pangeo environment for National Water Model exploration
MIT License
2 stars 1 forks source link

Data for CDI workshop #17

Closed rsignell-usgs closed 5 years ago

rsignell-usgs commented 6 years ago

@dbuscombe-usgs has about 20GB of data that he would like students in the CDI workshop to be able to access. He has loaded it to google drive and shared with my USGS google account.

I'm currently copying it to AWS S3 using this rclone command:

rclone sync gdrive-usgs:imageclass aws:cdi-workshop --checksum --fast-list --transfers 16 &
dbuscombe-usgs commented 6 years ago

FYI, the final data set will be in the region 30 GB

rsignell-usgs commented 6 years ago

The above rclone command syncs google drive with s3, so that will be no problem.

Here are two ways to read image data from s3:

https://gist.github.com/rsignell-usgs/c4e7650d5f94c00d6a7b7cd67acf2ab9

Will one of these work?

If not, another approach might be to try https://github.com/s3fs-fuse/s3fs-fuse

dbuscombe-usgs commented 6 years ago

The boto3 approach didn't work because it needs a credentials file. Option 2 with s3fs worked - thanks!

rsignell-usgs commented 6 years ago

Cool. So I'm trying to make sure we have all the data from Google Drive copied to S3.

I think you have 564277 files on Google Drive:

$ rclone ls gdrive-usgs:imageclass | wc -l
564277

but right now there are only 133022 files on S3:

$ rclone ls aws:cdi-workshop | wc -l
133022

I think this means my rclone command is not syncing as it should be, but restarting the transfer every time I restart it. It's running now, but when it quits, I'll do a little more in-depth sleuthing to figure out what is going on.

rsignell-usgs commented 6 years ago

Ah, I'm getting this error:

[1]+  Running                 rclone sync gdrive-usgs:imageclass aws:cdi-workshop --checksum --fast-list --transfers 16 &
(IOOS3) [rsignell@sand ~]$ 2018/06/29 16:36:26 ERROR : semseg_data/ccr/all/tile_96/road/PO2DOF.jpg: Failed to copy: failed to open source object: open file failed: googleapi: Error 403: The download quota for this file has been exceeded., downloadQuotaExceeded

So now that the process has died, checking how much have transferred:

(IOOS3) [rsignell@sand ~]$ rclone size aws:cdi-workshop
Total objects: 175522
Total size: 27.063 GBytes (29058733208 Bytes)

I'll try again tomorrow to see if it starts again and add on. But best solution might be just for @dbuscombe-usgs to push the data to S3 from his own machine instead of going through google drive. Seems like google drive is imposing annoying quotas for download.

dbuscombe-usgs commented 6 years ago

I could push from my own machine - just tell me how. That might be better in the longer term anyway in case I need to add even more files

rsignell-usgs commented 6 years ago

I ran the sync command again, and after dying again with the quota limit, it's at:

(IOOS3) [rsignell@sand ~]$ rclone size aws:cdi-workshop
Total objects: 514165
Total size: 29.329 GBytes (31492012782 Bytes)

closing in! I'll run the sync again tomorrow.

rsignell-usgs commented 6 years ago

I fired off the rclone sync command and it finished with no errors this time. Here's the size:

(IOOS3) [rsignell@sand ~]$ rclone size aws:cdi-workshop                         
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)

@dbuscombe-usgs, does that look complete? Did you add 564328 - 564277 = 51 new files in the last two days?

dbuscombe-usgs commented 6 years ago

Yes, 51 new files.

Total size should be 33.8 GB. I've looked through and doesn't seem to be anything major missing. Is there some way I can do a diff between aws:cdi-workshop and gdrive-usgs:imageclass to see what's missing?

rsignell-usgs commented 6 years ago

This is what rclone is telling me is on google drive:

(IOOS3) [rsignell@sand ~]$ rclone size gdrive-usgs:imageclass
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)

It matches. How many files do you have? That might a better check.

dbuscombe-usgs commented 6 years ago

Apologies, yes you are correct. 29.459 GB. I was counting another folder as well

rsignell-usgs commented 5 years ago

The pangeo.esipfed.org cluster is now running in us-east-1 instead of us-west-2.
@dbuscombe-usgs , okay if I move the data to us-east-1?

dbuscombe-usgs commented 5 years ago

sorry for late reply. Yes!

btw, I'm currently unable to start a server at http://pangeo.esipfed.org/hub/user/dbuscombe-usgs/

rsignell-usgs commented 5 years ago

@dbuscombe-usgs I was having some problems with the cluster/Jupyterhub, but I think they are fixed. Try again?

dbuscombe-usgs commented 5 years ago

@rsignell-usgs Looks like it's working now. Thanks

rsignell-usgs commented 5 years ago

@dbuscombe-usgs , I've copied the data to us-east-1 because http://pangeo.esipfed.org is now running on us-east-1 and this way we won't be moving data across regions (which should make things cheaper and faster).

The new S3 location on us-east-1 is: esipfed.cdi-workshop. The data still exists on us-west-2 on cdi-workshop, but it would be best to change the notebooks to the former for the reasons above.

Make sense?

dbuscombe-usgs commented 5 years ago

Makes sense but I still don't know how to actually make sure I'm not pulling data from us-west-2. For example, I make use of the s3fs utility like so:

import s3fs
fs = s3fs.S3FileSystem(anon=True)
with fs.open('cdi-workshop/imrecog_data/NWPU-RESISC45/test/airplane/airplane_700.jpg', 'rb') as f:
    image = color.rgb2gray(imread(f, 'jpg'))

How would I modify that?

rsignell-usgs commented 5 years ago

Just add esipfed/ to the bucket name like this:

import s3fs
fs = s3fs.S3FileSystem(anon=True)
with fs.open('esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/airplane/airplane_700.jpg', 'rb') as f:
    image = color.rgb2gray(imread(f, 'jpg'))

can you do a global replace?

dbuscombe-usgs commented 5 years ago

A little cryptic, but ok. I just noticed that not all the data is accessible this way.

fs.ls('cdi-workshop')

returns

['cdi-workshop/fully_conv_semseg',
 'cdi-workshop/imrecog_data',
 'cdi-workshop/semseg_data']

and

fs.ls('esipfed/cdi-workshop')

returns

['esipfed/cdi-workshop/fully_conv_semseg', 'esipfed/cdi-workshop/imrecog_data']
dbuscombe-usgs commented 5 years ago

I'm also now getting 'access denied' errors

from imageio import imread

with fs.open('esipfed/cdi-workshop/imrecog_data/EuroSAT/Highway/Highway_1.jpg', 'rb') as f:
    im = imread(f, 'jpg')
---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
<ipython-input-13-b6fa61a395e9> in <module>()
      2 
      3 with fs.open('esipfed/cdi-workshop/imrecog_data/EuroSAT/Highway/Highway_1.jpg', 'rb') as f:
----> 4     im = imread(f, 'jpg')

/opt/conda/lib/python3.6/site-packages/imageio/core/functions.py in imread(uri, format, **kwargs)
    204 
    205     # Get reader and read first
--> 206     reader = read(uri, format, 'i', **kwargs)
    207     with reader:
    208         return reader.get_data(0)

/opt/conda/lib/python3.6/site-packages/imageio/core/functions.py in get_reader(uri, format, mode, **kwargs)
    127 
    128     # Return its reader object
--> 129     return format.get_reader(request)
    130 
    131 

/opt/conda/lib/python3.6/site-packages/imageio/core/format.py in get_reader(self, request)
    166             raise RuntimeError('Format %s cannot read in mode %r' % 
    167                                (self.name, select_mode))
--> 168         return self.Reader(self, request)
    169 
    170     def get_writer(self, request):

/opt/conda/lib/python3.6/site-packages/imageio/core/format.py in __init__(self, format, request)
    215             self._request = request
    216             # Open the reader/writer
--> 217             self._open(**self.request.kwargs.copy())
    218 
    219         @property

/opt/conda/lib/python3.6/site-packages/imageio/plugins/pillow.py in _open(self, pilmode, as_gray, exifrotate)
    396         def _open(self, pilmode=None, as_gray=False, exifrotate=True):
    397             return PillowFormat.Reader._open(self,
--> 398                                              pilmode=pilmode, as_gray=as_gray)
    399 
    400         def _get_file(self):

/opt/conda/lib/python3.6/site-packages/imageio/plugins/pillow.py in _open(self, pilmode, as_gray)
    120                                    self.format.name)
    121             self._fp = self._get_file()
--> 122             self._im = factory(self._fp, '')
    123             if hasattr(Image, '_decompression_bomb_check'):
    124                 Image._decompression_bomb_check(self._im.size)

/opt/conda/lib/python3.6/site-packages/PIL/JpegImagePlugin.py in jpeg_factory(fp, filename)
    778 # Factory for making JPEG and MPO instances
    779 def jpeg_factory(fp=None, filename=None):
--> 780     im = JpegImageFile(fp, filename)
    781     try:
    782         mpheader = im._getmp()

/opt/conda/lib/python3.6/site-packages/PIL/ImageFile.py in __init__(self, fp, filename)
    100 
    101         try:
--> 102             self._open()
    103         except (IndexError,  # end of data
    104                 TypeError,  # end of data (ord)

/opt/conda/lib/python3.6/site-packages/PIL/JpegImagePlugin.py in _open(self)
    305     def _open(self):
    306 
--> 307         s = self.fp.read(1)
    308 
    309         if i8(s) != 255:

/opt/conda/lib/python3.6/site-packages/s3fs/core.py in read(self, length)
   1309         if self.closed:
   1310             raise ValueError('I/O operation on closed file.')
-> 1311         self._fetch(self.loc, self.loc + length)
   1312         out = self.cache[self.loc - self.start:
   1313                          self.loc - self.start + length]

/opt/conda/lib/python3.6/site-packages/s3fs/core.py in _fetch(self, start, end)
   1273             self.cache = _fetch_range(self.s3.s3, self.bucket, self.key,
   1274                                       version_id, start, self.end,
-> 1275                                       req_kw=self.s3.req_kw)
   1276         if start < self.start:
   1277             if not self.fill_cache and end + self.blocksize < self.start:

/opt/conda/lib/python3.6/site-packages/s3fs/core.py in _fetch_range(client, bucket, key, version_id, start, end, max_attempts, req_kw)
   1493             resp = client.get_object(Bucket=bucket, Key=key,
   1494                                      Range='bytes=%i-%i' % (start, end - 1),
-> 1495                                      **kwargs)
   1496             return resp['Body'].read()
   1497         except S3_RETRYABLE_ERRORS as e:

/opt/conda/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    312                     "%s() only accepts keyword arguments." % py_operation_name)
    313             # The "self" in this scope is referring to the BaseClient.
--> 314             return self._make_api_call(operation_name, kwargs)
    315 
    316         _api_call.__name__ = str(py_operation_name)

/opt/conda/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    610             error_code = parsed_response.get("Error", {}).get("Code")
    611             error_class = self.exceptions.from_code(error_code)
--> 612             raise error_class(parsed_response, operation_name)
    613         else:
    614             return parsed_response

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

However without the esipfed works just fine

from imageio import imread

with fs.open('cdi-workshop/imrecog_data/EuroSAT/Highway/Highway_1.jpg', 'rb') as f:
    im = imread(f, 'jpg')
rsignell-usgs commented 5 years ago

I'm trying the sync command again:

aws s3 sync s3://cdi-workshop s3://esipfed/cdi-workshop --source-region us-west-2 --region us-east-1 --acl "public-read"

Some files/permissions must not have transferred correctly.

dbuscombe-usgs commented 5 years ago

Thanks!

rsignell-usgs commented 5 years ago

Getting closer...

[ec2-user@ip-172-31-29-161 ~]$ rclone size s3-west:cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
[ec2-user@ip-172-31-29-161 ~]$ rclone size s3-east:esipfed/cdi-workshop
Total objects: 441391
Total size: 26.153 GBytes (28081266362 Bytes)

Still running....

rsignell-usgs commented 5 years ago

@dbuscombe-usgs, should be good to go! Please try again!

[ec2-user@ip-172-31-29-161 ~]$  rclone size s3-west:cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
[ec2-user@ip-172-31-29-161 ~]$  rclone size s3-east:esipfed/cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
dbuscombe-usgs commented 5 years ago

Great - thanks!

dbuscombe-usgs commented 5 years ago

Looks like all the images are now there, but I'm still getting permissions problems. Looks like everything in subfolders of /imrecog_data/EuroSAT are off-limits

rsignell-usgs commented 5 years ago

@dbuscombe-usgs I set the esipfed bucket policy to public read:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AddPerm",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::esipfed/*"
        }
    ]
}

I think that should make everything in the entire bucket readable. Can you please try again?

dbuscombe-usgs commented 5 years ago

Thanks @rsignell-usgs, I can now read everything