CityOfLosAngeles / aqueduct

A shared pipeline for building ETLs and batch jobs that we run at the City of LA for Data Science Projects. Built on Apache Airflow & Civis Platform
Apache License 2.0
21 stars 6 forks source link

exporting data to powerBI, ESRI, etc #350

Open hunterowens opened 3 years ago

hunterowens commented 3 years ago

Currently, the ITA team and Planning PMU team have been working on getting a project done regarding planning entitlements.

This project consists of 3-4 source datasets, some static (Census, etc) and a couple dynamic (PCTS - Planning Case Tracking System).

We have a number of cleaning scripts (see src) and a custom package for subsetting this data (see laplan). We'd like to expose this to end users in the planning performance management section (Power BI, Excel, GDrive) and the GIS unit (Arc) to start.

Here are the options I see, but curious if Civis has worked with this before.

  1. Save Static Files to S3, SFTP, or Similar
  2. Expose Database Connection
  3. Expose Python API

Ref

hunterowens commented 3 years ago

cc @tiffanychu90 @trantom

hunterowens commented 3 years ago

existing planning PowerBI dashboards

tiffanychu90 commented 3 years ago

Specifically, this notebook is one where City Planning would probably want as a PowerBI report.

Cell 4 is where we're at a bit of an impasse:

  1. It loads several processed/cleaned tables. --> We're more familiar with how to schedule these processed/cleaned tables in Civis.
  2. It joins these tables and aggregates. --> We need help with this step, because we're relying on Python functions to do it, and the resulting table is what is backing the interactive map.
  3. It puts the results on a map. Users can then filter by year or categories of interest. --> PowerBI report
ian-r-rose commented 3 years ago

Took a look at python/R integration into powerbi today. In theory you can load a pandas/dplyr dataframe and hand it off to a dashboard element. In practice, it really doesn't seem like it works well at all. It does a really bad job of finding a local environment, has ugly shims that break scripts, and I was more-or-less unable to get anything to work.

There are connectors for web URLs and Redshift -- those may ultimately be better options

trantom commented 3 years ago

I would say using a default connector would significantly reduce the overhead by PMU to access this data and others who use PowerBI. The other default connectors can also support some of the DBMS that Planning has as well I think it is a matter of getting a current environment description and agreement on who may be supporting it after the transfer through working with David Terukina and Mony's teams on the systems side. Our GIS side has data on SQL Server and the spatial data for the web gis on an ArcGIS data store.

@ian-r-rose The Python integration for me seems more for ad-hoc reports and less for live or dynamically updated ones. I think Python visualizations should be kept in their native environment since trying to work in Python with PowerBI is additional setup overhead we want to abstract away. Also, there are data input limits and no cross-filtering which is a key feature of PowerBI reports. Microsoft spells out the limitations here.

hunterowens commented 3 years ago

Pipeline test from Civis -> Google Sheets -> PowerBi worked

ian-r-rose commented 3 years ago

Pipeline test from Civis -> Google Sheets -> PowerBi worked

Very nice

trantom commented 3 years ago

I was also able to successfully run the civis-df-to-drive-test.py locally instead of on platform to the Scratch schema on the postgres cluster and then on Civis export and overwrite a Google sheet. However the main caveat is the limit 5,000,000 cells.

100,000 records x 22 columns was okay 200,000 records x 22 columns gives an error even though it's not 5,000,000 cells yet.

Export output log:

Failed to sync some datasets. The following errors were encountered: Too many failures uploading to Google Drive, please rerun or check job settings, the most recent failure error was: Error appending to worksheet named: Sheet1: badRequest: This action would increase the number of cells in the workbook above the limit of 5000000 cells.