apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.53k stars 3.54k forks source link

[C++] GCS: report common prefixes as directories #32403

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I got confused at the behavior differences between S3 and GCS, only to realize GCS only reports special directory markers as "directories" and not the common prefixes. This can have the effect of making a directory look empty in GCS, when it in fact has many folders (see example below).

We currently use the ListObjects method, but perhaps it would be more appropriate to use the ListObjectsWithPrefix. Since they are returned in the same API call, it shouldn't add much overhead.


library(arrow)

bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3, anonymous = TRUE)
s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override = "https://storage.googleapis.com")

# We did not create directory markers when uploading the data
# https://github.com/apache/arrow/pull/11842#discussion_r764204767

# The directory appears empty to GCSFileSystem...
bucket$ls("nyc-taxi")
#> character(0)

# ... but S3FileSystem knows otherwise!
s3_bucket$ls("nyc-taxi")
#>  [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
#>  [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
#>  [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
#> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
#> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"

# Using GCS API, we only get files!
bucket$ls("nyc-taxi", recursive = TRUE)
#>   [1] "nyc-taxi/year=2009/month=1/part-0.parquet" 
#>   [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
#> ...
#> [157] "nyc-taxi/year=2022/month=1/part-0.parquet" 
#> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"

# Using S3 API, we can get directories!
s3_bucket$ls("nyc-taxi", recursive = TRUE)
#>   [1] "nyc-taxi/year=2009"                        
#>   [2] "nyc-taxi/year=2009/month=1"                
#>   [3] "nyc-taxi/year=2009/month=1/part-0.parquet" 
#>   [4] "nyc-taxi/year=2009/month=10"               
#>   [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
#>   [6] "nyc-taxi/year=2009/month=11"               
#> ...
#> [329] "nyc-taxi/year=2022/month=2"                
#> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"

Reporter: Will Jones / @wjones127

Note: This issue was originally created as ARROW-17097. Please see the migration documentation for further details.

asfimport commented 2 years ago

Will Jones / @wjones127: @coryan You maybe have had good reason for implementing differently in https://github.com/apache/arrow/pull/11842 but I thought I might ask :)

asfimport commented 2 years ago

Carlos O'Ryan / @coryan: The details are a bit fuzzy at the moment.  At a high level, all approaches to simulate folders over GCS will fail, but will fail in different ways.  You can make something like ListObjects() return directory markers  for common prefixes, but then trying to call GetFileInfo() on those markers will fail.  Or will need to be very expensive.  In hindsight, I should have written a design doc outlining the tradeoffs and the decisions, but I did not realize when I started the project that the API (and tests) that there would be so many.

 

drauschenbach commented 11 months ago

I wanted to leave this breadcrumb somewhere, but not sure where. I noticed a discrepancy between "directories" created via Arrow vs directories created via the GCS cloud console. One uses a traliing slash while the other does not.

In my C++ code, I have to defensively call GetFileInfo() twice, once with and once without a trailing slash.

wjones127 commented 10 months ago

You can make something like ListObjects() return directory markers for common prefixes, but then trying to call GetFileInfo() on those markers will fail.

I've encountered this again, and I think the tradeoff of making ListObjects() work as expected but GetFileInfo() being surprising makes more sense to me. I think people expect common prefixes to work on object stores without special markers, but would understand that directories aren't "real" on object stores.