cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
5 stars 0 forks source link

Test PowerBI connections to S3 data #308

Closed jkarpen closed 1 month ago

jkarpen commented 1 month ago

Data is being loaded into S3 in Parquet format. This data needs to be accessible to PowerBI. There is a known issue where PowerBI's Parquet connector does not work when connecting to S3 (details here.

The goal for this issue is to test the following alternative options to access this data from PowerBI:

Sample URL for Parquet file: https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.parquet

Sample URL for Gzipped version: https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.csv.gz

Example Python code Ian wrote to test: import pandas df = pandas.read_parquet("https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.parquet")

jkarpen commented 1 month ago

Example of the error message received when trying to connect PowerBI to the parquet file on S3 using the Parquet connector:

Image

jkarpen commented 1 month ago

From online article:

https://blog.crossjoin.co.uk/2021/03/07/parquet-files-in-power-bi-power-query-and-the-streamed-binary-values-error/

"The error is raised by Power Query but it’s not really a limitation of Power Query: it’s to do with how Power Query accesses the Parquet file. If the Parquet file is on a local file system you won’t have this problem, but if Power Query needs to access the Parquet file via an API then you probably will because most APIs don’t allow the kind of random file access that is necessary to read data from Parquet (ADLSgen2 being an exception). The workaround with Binary.Buffer is simply avoiding the API by downloading the entire file into local memory and accessing it from there, but then you will run into the limit on container size (see https://blog.crossjoin.co.uk/2019/04/21/power-bi-dataflow-container-size/); on a gateway the container size is calculated relative to the amount of memory on the machine it’s running on (see https://blog.crossjoin.co.uk/2022/02/13/speed-up-power-bi-refresh-by-increasing-the-amount-of-memory-on-your-on-premises-data-gateway-machine/) so your best bet is to increase the amount of RAM on the gateway PC."

jkarpen commented 1 month ago

Was able to successfully connect to the parquet data on S3 using a Python connector. Note that this requires Python be installed locally with the following packages installed: pandas matplotlib pyarrow OR fastparquet (pyarrow alone is sufficient)

Image

jkarpen commented 1 month ago

@jkarpen will add the steps here for the Python connector, then close this issue. Will also add Jianfei's slide deck for the other options presented.

jkarpen commented 1 month ago

Steps to implement the Python connector option:

  1. Install Python on the local machine (or server for PowerBI reporting service). Must have the following package libraries installed: pandas, matplotlib, pyarrow
  2. In PowerBI, choose "Get Data", then "Get data to get started" to view more connector options
  3. Choose the Python Connector option Image
  4. Paste this code:

df = pandas.read_parquet("https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.parquet")

  1. The resulting data frame has all fields showing as the text data type. Numeric fields will need to be transformed in PowerBI (this is a one-time transformation for a given file). This can be done in the Edit Query menu. Select all the fields that need to be changed, then right-click the header for one of them. Choose Change Type > Decimal Number (or appropriate data type).

Image

jkarpen commented 1 month ago

Power BI connect Amazon S3.pptx

Attaching the slide deck created by Jianfei Wu showing access using a custom M Query with the Binary.Buffer option for reading in the Parquet file. This also includes an example of a Python API connection. In this case the S3 bucket is publicly accessible so the access key parameters are not needed, but this would be useful if there is ever a need for Caltrans to utilize a non-public S3 bucket.

jkarpen commented 1 month ago

Closing this issue, we have confirmed there are multiple options for reading in Parquet files into PowerBI.