Closed edasmalchi closed 2 years ago
Here's my solution that I did in a notebook, you have to put the file into the GCS bucket, instead of just exporting the normal pandas way.
You have to export it to a local path first, then put that local file into GCS bucket. We could probably make this into a function that writes locally, uploads, then erases the local file.
from calitp.storage import get_fs
fs = get_fs()
# Export to GCS (but save locally first)
FILE_NAME = "bus_stop_times_by_tract.parquet"
final_df.to_parquet(f"./{FILE_NAME}")
fs.put(f"./{FILE_NAME}", f"{utils.GCS_FILE_PATH}{FILE_NAME}")
Oh cool, that's not too much extra work. Thanks!
I think it would be great to have this as a function.
Something like this seems to work ok
import os
from calitp.storage import get_fs
fs = get_fs()
def geoparquet_gcs_export(gdf, name):
'''
Save geodataframe as parquet locally, then move to GCS bucket and delete local file.
'''
gdf.to_parquet(f"{name}.parquet")
fs.put(f"./{name}.parquet", f"{GCS_FILE_PATH}{name}.parquet")
os.remove(f"./{name}.parquet")
return
Starting a shared_utils
folder that can be installed with data-analyses
repo. See your function here: https://github.com/cal-itp/data-analyses/blob/2100397cef54e838f3b912e27d5a2539ddaba124/shared_utils/utils.py
Awesome, thanks!
User stories
A user story is implemented as well as it is communicated. If the context and the goals are made clear, it will be easier for everyone to implement it, test it, refer to it.
Summary
As an analyst, I want to be able to easily save intermediate or final spatial datasets to a gcs bucket, so that I can work more efficiently and more easily share draft results.
The work being done with intake looks great for ingesting one-off datasets, but I think it would be helpful to have an easy output capability, too.
Acceptance Criteria
If I have a geopandas geodataframe, I should easily be able to save it to a GCS bucket, ideally as a .parquet but alternatively as a .geojson or other format.
Also, here are a few points that need to be addressed:
While a regular pandas dataframe saves just fine to a GCS bucket using a GCS path + gcsfs, it doesn't seem like geopandas' parquet implementation currently allows this:
Trying to save a geodataframe to a GCS bucket as a .geojson seems to hang; the notebook shows busy but nothing happens for at least several minutes
Notes
Tester [@edasmalchi]
Sprint Ready Checklist