digital-land / technical-documentation

Technical Documentation for the planning data service.
https://digital-land.github.io/technical-documentation/index.html
0 stars 0 forks source link

Spike: Datasette Parquet #135

Open Ben-Hodgkiss opened 2 hours ago

Ben-Hodgkiss commented 2 hours ago

Overview

We know that datasette is not a sustainable solution for interrogating pipeline artefacts and transformed data. At least for application due to the nature of how the data get’s updated. One solution is to leave the data in s3 as parquet files and to use the datasette-parquet plugin to access the. data. Specifically:

can it connect and query directly from s3 or do the files need. to be local. (if. so how hard is it to set up given that httpfs is supported by. duckDB)

Does the performance seem comparable?

Can any bugs be sorted out by creating our own version of the plugin

Pull Request(PR):

Tech Approach

datasette-parquet appears to not be updated frequently but. it’s all open source so can we take a copy of it to create our own open-source plugin that we can maintain and improve

I have already created a plugin to play around with here https://github.com/digital-land/datasette-digital-land

you can create a branch of datasette-builder to play with. Docker compose can be used to create local stack images with. s3 buckets. see here https://github.com/digital-land/collection-task/blob/main/docker-compose.yaml . Owen has some parquet files if needed.

Acceptance Criteria/Tests

We need to see a locally running branch of datasette builder which is able to be supported from an s3 bucket rather than local storage.

Is the local performance of this acceptable?

ssadhu-sl commented 2 hours ago

Major Changes made to datasette-parquet plugin:

Updated metadata.json S3 credentials and also to use "httpfs": true to enable direct access to the Parquet files stored in S3

Modified the plugin code (in init.py and related files) to ensure compatibility with the view creation SQL and direct access to the S3-hosted Parquet files

To summarise, I have made a view which directly reads from parquet file and is served on datasette. Tested this with issues.parquet file stored in S3 in localstack.

ssadhu-sl commented 2 hours ago

Branches: https://github.com/digital-land/datasette-digital-land/tree/datasette_parquet_spike