Listing prefixes in bucket with get_bucket

JanLauGe commented 5 years ago

First of all, thank you so much for making and maintaining this package, it has made my work with AWS so much better!

This is a question about package functionality.

I am working with an S3 bucket containing a very large number of small files. Currently, I am using the aws-cli on the command line but would like to switch this to cloudyr/aws.s3. My problem is that I am relying on the "folder" structure of s3 to identify the files I am interested in. I am aware that these are not really folders but rather prefix strings. However, utilising these works well with the aws-cli. Can I achieve something similar with cloudyr/aws.s3?

Attempt of creating an example using a public bucket:

$ aws s3 ls landsat-pds --no-sign-request
#                           PRE L8/
#                           PRE c1/
#                           PRE landsat-pds_stats/
#                           PRE runs/
#                           PRE tarq/
#                           PRE tarq_archive/
#                           PRE tarq_corrupt/
#                           PRE test/
# 2017-05-17 14:42:27      23767 index.html
# 2016-08-19 18:12:04        105 robots.txt
# 2019-01-09 11:00:43        160 run_info.json
# 2019-01-09 00:47:00         38 run_info_dev.json
# 2019-01-08 22:21:46     311836 run_list.txt
# 2018-08-29 01:45:15   45603307 scene_list.gz

$ aws s3 ls landsat-pds --no-sign-request | grep tarq_archive/
#                            PRE tarq_archive/

In reality the response from AWS is a lot longer and my regex is identifying the folder name signifying the most recent date, e.g. 20190108. You can see how this requires the response to list PRE items, as above, but not recursively listing objects within each PRE. I then use aws.s3::s3sync to download all files in that folder. I can't sync all folders, it would take hours.

My problem is that I don't know how to get PRE elements with aws.s3. For example, I tried the below but only get non-PRE items, or none if my criteria are too strict, e.g.:

aws.s3::get_bucket_df(bucket = "landsat-pds")
# 2017-05-17 14:42:27      23767 index.html
# 2016-08-19 18:12:04        105 robots.txt
# 2019-01-09 11:00:43        160 run_info.json
# 2019-01-09 00:47:00         38 run_info_dev.json
# 2019-01-08 22:21:46     311836 run_list.txt
# 2018-08-29 01:45:15   45603307 scene_list.gz

aws.s3::get_bucket_df(
    bucket = "landsat-pds",
    prefix = "tarq_archive",
    delimiter = "/")
# [1] Key               LastModified      ETag           Size              Owner_ID          Owner_DisplayName StorageClass      Bucket           
# <0 rows> (or 0-length row.names)

Any recommendation on how to do this with aws.s3?

Thanks a lot in advance!

vortexing commented 5 years ago

I believe I have a similar need. I'd actually like to get the total size of all the files within each PRE (so a data frame full of instead of key-lastmodified-,etc, it'd be prefix-size-etc).

I need get_bucket_df to just give me the first layer of prefixes in the bucket or prefixes that start with a given string (so all PRE with the prefix "tarq_archive" if delimiter = "/" again, would give me a list of all the prefixes inside the "targ_archive" "folder", etc). @JanLauGe is that similar to your need?

JanLauGe commented 5 years ago

Yeah I think it's a similar application that could be tackled by the same functionality. Any progress on how to do this with aws.s3?

ines-ana commented 4 years ago

Hi! I am struggling with the same problem.

I want to list all the objects inside a specific PRE, but instead I get the list of all the existing paths/objects inside the specified PRE.

I am using the function aws.s3::get_bucket_df and the delimiter parameter doesn't seem to work as intended (or maybe I do not understand how it should be working).

Taking an example from the AWS documentation on how to use prefixes and delimiters, let's say my s3 bucket had the following paths inside:

Europe/France/Nouvelle-Aquitaine/Bordeaux North America/Canada/Quebec/Montreal North America/USA/Washington/Bellevue North America/USA/Washington/Seattle

if I want to list all the states in USA I should run the function: get_bucket_df( bucket = "bucketName", prefix = "North America/USA/", delimiter = "/") but the delimiter doesn't work and the function returns only the path "North America/USA/".

In fact, the delimiter seems to be excluding every path that has the string defined in the delimiter parameter, because if I run the command: get_bucket_df( bucket = "bucketName", prefix = "North America/USA/", delimiter = "/Bellevue") I'm expecting to get "Washington" as a response, but it returns: "North America/USA/" "North America/USA/Washington/" "North America/USA/Washington/Seattle"

Not sure if I am using the function incorrectly or if the delimiter doesn't work the way I expect. If this is an issue of the function get_bucket, do you have an idea of how and when it can be fixed?

Thank you in advance!

cole-johanson commented 3 years ago

The issue seems to be in the as.data.frame wrapper, which truncates the CommonPrefixes attribute.

Workaround: as.data.frame(attributes(get_bucket(bucket, prefix = prefix, delimiter = delimiter))$CommonPrefixes)

cloudyr / aws.s3

Listing prefixes in bucket with get_bucket #283