Open asfimport opened 2 years ago
Will Jones / @wjones127: @coryan You maybe have had good reason for implementing differently in https://github.com/apache/arrow/pull/11842 but I thought I might ask :)
Carlos O'Ryan / @coryan:
The details are a bit fuzzy at the moment. At a high level, all approaches to simulate folders over GCS will fail, but will fail in different ways. You can make something like ListObjects()
return directory markers for common prefixes, but then trying to call GetFileInfo()
on those markers will fail. Or will need to be very expensive. In hindsight, I should have written a design doc outlining the tradeoffs and the decisions, but I did not realize when I started the project that the API (and tests) that there would be so many.
I wanted to leave this breadcrumb somewhere, but not sure where. I noticed a discrepancy between "directories" created via Arrow vs directories created via the GCS cloud console. One uses a traliing slash while the other does not.
In my C++ code, I have to defensively call GetFileInfo()
twice, once with and once without a trailing slash.
You can make something like ListObjects() return directory markers for common prefixes, but then trying to call GetFileInfo() on those markers will fail.
I've encountered this again, and I think the tradeoff of making ListObjects()
work as expected but GetFileInfo()
being surprising makes more sense to me. I think people expect common prefixes to work on object stores without special markers, but would understand that directories aren't "real" on object stores.
I got confused at the behavior differences between S3 and GCS, only to realize GCS only reports special directory markers as "directories" and not the common prefixes. This can have the effect of making a directory look empty in GCS, when it in fact has many folders (see example below).
We currently use the ListObjects method, but perhaps it would be more appropriate to use the ListObjectsWithPrefix. Since they are returned in the same API call, it shouldn't add much overhead.
Reporter: Will Jones / @wjones127
Note: This issue was originally created as ARROW-17097. Please see the migration documentation for further details.