Open H-Plus-Time opened 3 months ago
An additional ClientOptions, exposed on the object store itself and the Wasm wrapper, is needed here.
Actually determining at runtime whether a HEAD is required (only in situations where the size of the file actually matters (cases where it doesn't matter: formats like parquet essentially describe themselves once you have the last 8 bytes))... that can be somewhat tricky if all you have is a range and a size that might be the size of the file or just the range.
This is one area where this repo will have to diverge from upstream (probably permanently) - each of the stores authored here will need an extra method (get_range_conditional_head, maybe get_ranges_conditional_head) that does:
Since the
meta.size
field is not an Option type, it must be set to something
With respect to Parquet specifically, I think the ParquetObjectReader
shouldn't require an ObjectMeta
in its constructor (or it should be optional). This is something I want to fix in https://github.com/apache/arrow-rs/issues/5979#issuecomment-2234204999 but I haven't gotten around to it yet. Happy to collaborate on that if you want.
Yeah, I was thinking along the same lines, that the main consumer (parquet-wasm, geoarrow-wasm) will likely be (transitively) refactored at the earliest by ~September (I agree, it really does seem to only be relevant for the FileMetadata extraction), thus removing most of the motivation for this.
Potentially worthwhile still implementing (I'm sure there's a bunch of other formats (probably zip with its' EOCD) that behave similarly), though actually benefiting from it does require controlling the CORS configuration of whatever you're requesting.
Well arrow 52.2.0 was just released, so we have about ~1 month to get this into parquet
, or else it will sit till December
Overall this is fairly straightforward - replicate the set of fixes, additional structs (ContentRange) and some of the error-reporting states.
There is, however, one important deviation when running inside a browser: 'Content-Range' is not a CORS-safelisted property :grimacing: : https://developer.mozilla.org/en-US/docs/Glossary/CORS-safelisted_response_header .
The upstream behaviour treats this as an error (rightly, given the HTTP spec mandates support for the
Range
header -> support for theContent-Range
header, and CORS is inapplicable outside the browser) - in the browser case, this is probably a no-go, given the paucity of sites that include it inAccess-Control-Expose-Headers
.Since the
meta.size
field is not an Option type, it must be set to something - the only problem with that is the natural fallback value (what object store used to do, namely use the Content-Length field) is likely to be wildly inaccurate.