Closed jkarpen closed 1 month ago
Example of the error message received when trying to connect PowerBI to the parquet file on S3 using the Parquet connector:
From online article:
"The error is raised by Power Query but it’s not really a limitation of Power Query: it’s to do with how Power Query accesses the Parquet file. If the Parquet file is on a local file system you won’t have this problem, but if Power Query needs to access the Parquet file via an API then you probably will because most APIs don’t allow the kind of random file access that is necessary to read data from Parquet (ADLSgen2 being an exception). The workaround with Binary.Buffer is simply avoiding the API by downloading the entire file into local memory and accessing it from there, but then you will run into the limit on container size (see https://blog.crossjoin.co.uk/2019/04/21/power-bi-dataflow-container-size/); on a gateway the container size is calculated relative to the amount of memory on the machine it’s running on (see https://blog.crossjoin.co.uk/2022/02/13/speed-up-power-bi-refresh-by-increasing-the-amount-of-memory-on-your-on-premises-data-gateway-machine/) so your best bet is to increase the amount of RAM on the gateway PC."
Was able to successfully connect to the parquet data on S3 using a Python connector. Note that this requires Python be installed locally with the following packages installed: pandas matplotlib pyarrow OR fastparquet (pyarrow alone is sufficient)
@jkarpen will add the steps here for the Python connector, then close this issue. Will also add Jianfei's slide deck for the other options presented.
Steps to implement the Python connector option:
df = pandas.read_parquet("https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.parquet")
Power BI connect Amazon S3.pptx
Attaching the slide deck created by Jianfei Wu showing access using a custom M Query with the Binary.Buffer option for reading in the Parquet file. This also includes an example of a Python API connection. In this case the S3 bucket is publicly accessible so the access key parameters are not needed, but this would be useful if there is ever a need for Caltrans to utilize a non-public S3 bucket.
Closing this issue, we have confirmed there are multiple options for reading in Parquet files into PowerBI.
Data is being loaded into S3 in Parquet format. This data needs to be accessible to PowerBI. There is a known issue where PowerBI's Parquet connector does not work when connecting to S3 (details here.
The goal for this issue is to test the following alternative options to access this data from PowerBI:
Sample URL for Parquet file: https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.parquet
Sample URL for Gzipped version: https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.csv.gz
Example Python code Ian wrote to test: import pandas df = pandas.read_parquet("https://caltrans-pems-dev-us-west-2-marts.s3.us-west-2.amazonaws.com/dbt_irose_performance/station_metrics_agg_monthly.parquet")