cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
45 stars 12 forks source link

User Story: Saving spatial data to GCS Buckets #698

Closed edasmalchi closed 2 years ago

edasmalchi commented 2 years ago

User stories

A user story is implemented as well as it is communicated. If the context and the goals are made clear, it will be easier for everyone to implement it, test it, refer to it.


Summary

As an analyst, I want to be able to easily save intermediate or final spatial datasets to a gcs bucket, so that I can work more efficiently and more easily share draft results.

The work being done with intake looks great for ingesting one-off datasets, but I think it would be helpful to have an easy output capability, too.

Acceptance Criteria

If I have a geopandas geodataframe, I should easily be able to save it to a GCS bucket, ideally as a .parquet but alternatively as a .geojson or other format.

Also, here are a few points that need to be addressed:

  1. While a regular pandas dataframe saves just fine to a GCS bucket using a GCS path + gcsfs, it doesn't seem like geopandas' parquet implementation currently allows this:

    ArrowInvalid: Unrecognized filesystem type in URI: gs://calitp-analytics-data/data-analyses/high_quality_transit_areas/bus_hqta.parquet
  2. Trying to save a geodataframe to a GCS bucket as a .geojson seems to hang; the notebook shows busy but nothing happens for at least several minutes

Notes

Tester [@edasmalchi]

  1. I'm happy to test/help figure this out

Sprint Ready Checklist

    • [ ] Acceptance criteria defined
    • [ ] Team understands acceptance criteria
    • [ ] Team has defined solution / steps to satisfy acceptance criteria
    • [ ] Acceptance criteria is verifiable / testable
    • [ ] External / 3rd Party dependencies identified
tiffanychu90 commented 2 years ago

Here's my solution that I did in a notebook, you have to put the file into the GCS bucket, instead of just exporting the normal pandas way.

You have to export it to a local path first, then put that local file into GCS bucket. We could probably make this into a function that writes locally, uploads, then erases the local file.

from calitp.storage import get_fs
fs = get_fs()

# Export to GCS (but save locally first)
FILE_NAME = "bus_stop_times_by_tract.parquet"
final_df.to_parquet(f"./{FILE_NAME}")

fs.put(f"./{FILE_NAME}", f"{utils.GCS_FILE_PATH}{FILE_NAME}")
edasmalchi commented 2 years ago

Oh cool, that's not too much extra work. Thanks!

I think it would be great to have this as a function.

edasmalchi commented 2 years ago

Something like this seems to work ok

import os
from calitp.storage import get_fs
fs = get_fs()

def geoparquet_gcs_export(gdf, name):
    '''
    Save geodataframe as parquet locally, then move to GCS bucket and delete local file.
    '''
    gdf.to_parquet(f"{name}.parquet")
    fs.put(f"./{name}.parquet", f"{GCS_FILE_PATH}{name}.parquet")
    os.remove(f"./{name}.parquet")
    return
tiffanychu90 commented 2 years ago

Starting a shared_utils folder that can be installed with data-analyses repo. See your function here: https://github.com/cal-itp/data-analyses/blob/2100397cef54e838f3b912e27d5a2539ddaba124/shared_utils/utils.py

edasmalchi commented 2 years ago

Awesome, thanks!