Right now none of the truly public data access methods seems to be functional, which defeats most of the purpose of publishing the data catalog. Several issues have come up:
Reading from https://storage.googleapis.com is catastrophically slow and tries to read the entire monolithic parquet file into memory rather than filtering it first, for both read_parquet() and intake
Reading partitioned data from https://storage.googleapis.com with read_parquet() doesn't work easily, since you need to actually list all of the files specifically. No wildcards or directories allowed.
Reading partitioned data from https://storage.googleapis.com with intake using a *.parquet wildcard fails with a 403 Forbidden error because the public user doesn't have permission to list all of the objects in the bucket.
Unauthenticated users cannot access data via gcs://catalyst.coop/intake/test even though everything in the (pseudo) directory is publicly readable. Do we need to create a separate gcs://intake.catalyst.coop bucket that is entirely public, rather than using object-level ACLs? Or is it simply not possible to provide public access over gcs://?
intake_parquet complains when you give it a directory (filled with parquet files) as its urlpath even though pandas and dask are happy to read from a directory. E.g. {{ env(INTAKE_PATH) }}/epacems/ does not work. The error says that all paths have to end with .parq or .parquet.
Ideally we would be able to provide public access both via gcs:// (which seems to provide much more "filesystem" like access) and over https:// (which has much better support in generic download tools for the less cloud-literate).
Need to understand the intended patterns of usage with public cloud accessible data, and how to make the public resource as functional / convenient as it can be.
May also need to understand better how to limit the risk of a bajillion downloads costing us on data egress fees, which might mean going requester-pays.
Right now none of the truly public data access methods seems to be functional, which defeats most of the purpose of publishing the data catalog. Several issues have come up:
https://storage.googleapis.com
is catastrophically slow and tries to read the entire monolithic parquet file into memory rather than filtering it first, for bothread_parquet()
andintake
https://storage.googleapis.com
withread_parquet()
doesn't work easily, since you need to actually list all of the files specifically. No wildcards or directories allowed.https://storage.googleapis.com
withintake
using a*.parquet
wildcard fails with a403 Forbidden
error because the public user doesn't have permission to list all of the objects in the bucket.gcs://catalyst.coop/intake/test
even though everything in the (pseudo) directory is publicly readable. Do we need to create a separategcs://intake.catalyst.coop
bucket that is entirely public, rather than using object-level ACLs? Or is it simply not possible to provide public access overgcs://
?intake_parquet
complains when you give it a directory (filled with parquet files) as itsurlpath
even thoughpandas
anddask
are happy to read from a directory. E.g.{{ env(INTAKE_PATH) }}/epacems/
does not work. The error says that all paths have to end with.parq
or.parquet
.Ideally we would be able to provide public access both via
gcs://
(which seems to provide much more "filesystem" like access) and overhttps://
(which has much better support in generic download tools for the less cloud-literate).Need to understand the intended patterns of usage with public cloud accessible data, and how to make the public resource as functional / convenient as it can be.
May also need to understand better how to limit the risk of a bajillion downloads costing us on data egress fees, which might mean going requester-pays.