Open njelmert opened 1 year ago
Thanks for raising an issue @njelmert !
I don't think to_parquet
/regenerate_dataset
were designed to handle remote storage yet, but I could be wrong. Either way, we should have the necessary tools in cudf/fsspec to do this now, so I'll try to figure out what needs to change.
Thank you, @rjzamora ! I hadn't considered that not being a current capability so that's good to know. For now I have a work-around which temporarily dumps the parquet files to disk and gsutils
them to remote storage. Looking forward to the adaptation.
Describe the bug Using
regenerate_dataset()
method on amerlin.io.dataset.Dataset
results in errorRuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/utilities/data_sink.cpp:37: Cannot open output file
when usinggcs
protocol.Full traceback is:
Steps/Code to reproduce bug
Expected behavior Expect to see parquet files in specified GCS bucket such as:
Environment details (please complete the following information):
Ubuntu 18
--scopes cloud-platform,storage-full,storage-rw
Versions of relevant libraries:
Additional context Looking at the source code here I was trying to see if this was a permissions issue with the
fsspec
GCSFileSystem
but I am able to see that my credentials, access, and tokens appear fine in the step:and I can also see that a token is generated and that it recognizes the project ID. I have no problem writing to this bucket using other frameworks (e.g. using PySpark, Pandas, pyarrow, etc.).
I dug around and this issue stuck out to me (GCSFileSystem hanging when called from multiple processes), but I am not sure if this is the right direction.