Closed rsignell-usgs closed 5 years ago
FYI, the final data set will be in the region 30 GB
The above rclone command syncs google drive with s3, so that will be no problem.
Here are two ways to read image data from s3:
https://gist.github.com/rsignell-usgs/c4e7650d5f94c00d6a7b7cd67acf2ab9
Will one of these work?
If not, another approach might be to try https://github.com/s3fs-fuse/s3fs-fuse
The boto3 approach didn't work because it needs a credentials file. Option 2 with s3fs worked - thanks!
Cool. So I'm trying to make sure we have all the data from Google Drive copied to S3.
I think you have 564277 files on Google Drive:
$ rclone ls gdrive-usgs:imageclass | wc -l
564277
but right now there are only 133022 files on S3:
$ rclone ls aws:cdi-workshop | wc -l
133022
I think this means my rclone command is not syncing as it should be, but restarting the transfer every time I restart it. It's running now, but when it quits, I'll do a little more in-depth sleuthing to figure out what is going on.
Ah, I'm getting this error:
[1]+ Running rclone sync gdrive-usgs:imageclass aws:cdi-workshop --checksum --fast-list --transfers 16 &
(IOOS3) [rsignell@sand ~]$ 2018/06/29 16:36:26 ERROR : semseg_data/ccr/all/tile_96/road/PO2DOF.jpg: Failed to copy: failed to open source object: open file failed: googleapi: Error 403: The download quota for this file has been exceeded., downloadQuotaExceeded
So now that the process has died, checking how much have transferred:
(IOOS3) [rsignell@sand ~]$ rclone size aws:cdi-workshop
Total objects: 175522
Total size: 27.063 GBytes (29058733208 Bytes)
I'll try again tomorrow to see if it starts again and add on. But best solution might be just for @dbuscombe-usgs to push the data to S3 from his own machine instead of going through google drive. Seems like google drive is imposing annoying quotas for download.
I could push from my own machine - just tell me how. That might be better in the longer term anyway in case I need to add even more files
I ran the sync command again, and after dying again with the quota limit, it's at:
(IOOS3) [rsignell@sand ~]$ rclone size aws:cdi-workshop
Total objects: 514165
Total size: 29.329 GBytes (31492012782 Bytes)
closing in! I'll run the sync again tomorrow.
I fired off the rclone sync command and it finished with no errors this time. Here's the size:
(IOOS3) [rsignell@sand ~]$ rclone size aws:cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
@dbuscombe-usgs, does that look complete?
Did you add 564328 - 564277 = 51
new files in the last two days?
Yes, 51 new files.
Total size should be 33.8 GB. I've looked through and doesn't seem to be anything major missing. Is there some way I can do a diff between aws:cdi-workshop and gdrive-usgs:imageclass to see what's missing?
This is what rclone is telling me is on google drive:
(IOOS3) [rsignell@sand ~]$ rclone size gdrive-usgs:imageclass
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
It matches. How many files do you have? That might a better check.
Apologies, yes you are correct. 29.459 GB. I was counting another folder as well
The pangeo.esipfed.org cluster is now running in us-east-1
instead of us-west-2
.
@dbuscombe-usgs , okay if I move the data to us-east-1
?
sorry for late reply. Yes!
btw, I'm currently unable to start a server at http://pangeo.esipfed.org/hub/user/dbuscombe-usgs/
@dbuscombe-usgs I was having some problems with the cluster/Jupyterhub, but I think they are fixed. Try again?
@rsignell-usgs Looks like it's working now. Thanks
@dbuscombe-usgs , I've copied the data to us-east-1
because http://pangeo.esipfed.org is now running on us-east-1
and this way we won't be moving data across regions (which should make things cheaper and faster).
The new S3 location on us-east-1
is: esipfed.cdi-workshop
. The data still exists on us-west-2
on cdi-workshop
, but it would be best to change the notebooks to the former for the reasons above.
Make sense?
Makes sense but I still don't know how to actually make sure I'm not pulling data from us-west-2
. For example, I make use of the s3fs
utility like so:
import s3fs
fs = s3fs.S3FileSystem(anon=True)
with fs.open('cdi-workshop/imrecog_data/NWPU-RESISC45/test/airplane/airplane_700.jpg', 'rb') as f:
image = color.rgb2gray(imread(f, 'jpg'))
How would I modify that?
Just add esipfed/
to the bucket name like this:
import s3fs
fs = s3fs.S3FileSystem(anon=True)
with fs.open('esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/airplane/airplane_700.jpg', 'rb') as f:
image = color.rgb2gray(imread(f, 'jpg'))
can you do a global replace?
A little cryptic, but ok. I just noticed that not all the data is accessible this way.
fs.ls('cdi-workshop')
returns
['cdi-workshop/fully_conv_semseg',
'cdi-workshop/imrecog_data',
'cdi-workshop/semseg_data']
and
fs.ls('esipfed/cdi-workshop')
returns
['esipfed/cdi-workshop/fully_conv_semseg', 'esipfed/cdi-workshop/imrecog_data']
I'm also now getting 'access denied' errors
from imageio import imread
with fs.open('esipfed/cdi-workshop/imrecog_data/EuroSAT/Highway/Highway_1.jpg', 'rb') as f:
im = imread(f, 'jpg')
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-13-b6fa61a395e9> in <module>()
2
3 with fs.open('esipfed/cdi-workshop/imrecog_data/EuroSAT/Highway/Highway_1.jpg', 'rb') as f:
----> 4 im = imread(f, 'jpg')
/opt/conda/lib/python3.6/site-packages/imageio/core/functions.py in imread(uri, format, **kwargs)
204
205 # Get reader and read first
--> 206 reader = read(uri, format, 'i', **kwargs)
207 with reader:
208 return reader.get_data(0)
/opt/conda/lib/python3.6/site-packages/imageio/core/functions.py in get_reader(uri, format, mode, **kwargs)
127
128 # Return its reader object
--> 129 return format.get_reader(request)
130
131
/opt/conda/lib/python3.6/site-packages/imageio/core/format.py in get_reader(self, request)
166 raise RuntimeError('Format %s cannot read in mode %r' %
167 (self.name, select_mode))
--> 168 return self.Reader(self, request)
169
170 def get_writer(self, request):
/opt/conda/lib/python3.6/site-packages/imageio/core/format.py in __init__(self, format, request)
215 self._request = request
216 # Open the reader/writer
--> 217 self._open(**self.request.kwargs.copy())
218
219 @property
/opt/conda/lib/python3.6/site-packages/imageio/plugins/pillow.py in _open(self, pilmode, as_gray, exifrotate)
396 def _open(self, pilmode=None, as_gray=False, exifrotate=True):
397 return PillowFormat.Reader._open(self,
--> 398 pilmode=pilmode, as_gray=as_gray)
399
400 def _get_file(self):
/opt/conda/lib/python3.6/site-packages/imageio/plugins/pillow.py in _open(self, pilmode, as_gray)
120 self.format.name)
121 self._fp = self._get_file()
--> 122 self._im = factory(self._fp, '')
123 if hasattr(Image, '_decompression_bomb_check'):
124 Image._decompression_bomb_check(self._im.size)
/opt/conda/lib/python3.6/site-packages/PIL/JpegImagePlugin.py in jpeg_factory(fp, filename)
778 # Factory for making JPEG and MPO instances
779 def jpeg_factory(fp=None, filename=None):
--> 780 im = JpegImageFile(fp, filename)
781 try:
782 mpheader = im._getmp()
/opt/conda/lib/python3.6/site-packages/PIL/ImageFile.py in __init__(self, fp, filename)
100
101 try:
--> 102 self._open()
103 except (IndexError, # end of data
104 TypeError, # end of data (ord)
/opt/conda/lib/python3.6/site-packages/PIL/JpegImagePlugin.py in _open(self)
305 def _open(self):
306
--> 307 s = self.fp.read(1)
308
309 if i8(s) != 255:
/opt/conda/lib/python3.6/site-packages/s3fs/core.py in read(self, length)
1309 if self.closed:
1310 raise ValueError('I/O operation on closed file.')
-> 1311 self._fetch(self.loc, self.loc + length)
1312 out = self.cache[self.loc - self.start:
1313 self.loc - self.start + length]
/opt/conda/lib/python3.6/site-packages/s3fs/core.py in _fetch(self, start, end)
1273 self.cache = _fetch_range(self.s3.s3, self.bucket, self.key,
1274 version_id, start, self.end,
-> 1275 req_kw=self.s3.req_kw)
1276 if start < self.start:
1277 if not self.fill_cache and end + self.blocksize < self.start:
/opt/conda/lib/python3.6/site-packages/s3fs/core.py in _fetch_range(client, bucket, key, version_id, start, end, max_attempts, req_kw)
1493 resp = client.get_object(Bucket=bucket, Key=key,
1494 Range='bytes=%i-%i' % (start, end - 1),
-> 1495 **kwargs)
1496 return resp['Body'].read()
1497 except S3_RETRYABLE_ERRORS as e:
/opt/conda/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
312 "%s() only accepts keyword arguments." % py_operation_name)
313 # The "self" in this scope is referring to the BaseClient.
--> 314 return self._make_api_call(operation_name, kwargs)
315
316 _api_call.__name__ = str(py_operation_name)
/opt/conda/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
610 error_code = parsed_response.get("Error", {}).get("Code")
611 error_class = self.exceptions.from_code(error_code)
--> 612 raise error_class(parsed_response, operation_name)
613 else:
614 return parsed_response
ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
However without the esipfed
works just fine
from imageio import imread
with fs.open('cdi-workshop/imrecog_data/EuroSAT/Highway/Highway_1.jpg', 'rb') as f:
im = imread(f, 'jpg')
I'm trying the sync command again:
aws s3 sync s3://cdi-workshop s3://esipfed/cdi-workshop --source-region us-west-2 --region us-east-1 --acl "public-read"
Some files/permissions must not have transferred correctly.
Thanks!
Getting closer...
[ec2-user@ip-172-31-29-161 ~]$ rclone size s3-west:cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
[ec2-user@ip-172-31-29-161 ~]$ rclone size s3-east:esipfed/cdi-workshop
Total objects: 441391
Total size: 26.153 GBytes (28081266362 Bytes)
Still running....
@dbuscombe-usgs, should be good to go! Please try again!
[ec2-user@ip-172-31-29-161 ~]$ rclone size s3-west:cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
[ec2-user@ip-172-31-29-161 ~]$ rclone size s3-east:esipfed/cdi-workshop
Total objects: 564328
Total size: 29.459 GBytes (31631116199 Bytes)
Great - thanks!
Looks like all the images are now there, but I'm still getting permissions problems. Looks like everything in subfolders of /imrecog_data/EuroSAT
are off-limits
@dbuscombe-usgs I set the esipfed
bucket policy to public read:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AddPerm",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::esipfed/*"
}
]
}
I think that should make everything in the entire bucket readable. Can you please try again?
Thanks @rsignell-usgs, I can now read everything
@dbuscombe-usgs has about 20GB of data that he would like students in the CDI workshop to be able to access. He has loaded it to google drive and shared with my USGS google account.
I'm currently copying it to AWS S3 using this
rclone
command: