Open partitioned parquet files

dvictori commented 1 year ago

I'm trying out metabase with duckdb for processing viewing a large parquet file. I'm running Metabase from a Docker container following the instructions found on this repo readme.

I was able to query several single parquet files using

select * from 'path/to/parquet_single_data/*.parquet'

However, I can't seam to find out how to do the same for a parquet file that was created with partitions. My parquet file is partitioned into years like so:

parquet_data/ano=2013/part-0.parquet
parquet_data/ano=2014/part-0.parquet
parquet_data/ano=2015/part-0.parquet

This is what I tried:

select * from read_parquet('path/to/parquet_data/*.parquet')

and

select * from read_parquet('path/to/parquet_data/*/*.parquet')

None worked

Is it possible to open partitioned parquet files in duckdb?

Edit: Looking at the duckdb documentation, one should use parquet_scan for this. But I'm getting Cannot invoke "Object.getClass()" because "target" is null

https://duckdb.org/docs/archive/0.8.1/data/partitioning/hive_partitioning

dvictori commented 1 year ago

As a complement, I check using the duckdb cli and the following query works.

I'm using Metabase v1.46.2 DuckDB v0.8.1

select * from read_parquet('/app/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

K377U commented 1 year ago

Im using duckdb views to do this and also setting the datatype of each field instead of using star notation.

CREATE OR REPLACE VIEW {table_name} AS (
SELECT {fields} FROM read_parquet(['path/to/parquet_data/*/*.parquet'], HIVE_PARTITIONING=true)
)

With this you can add or remove data from the path and it will automatically update to metabase. Not sure how well the caching works when you do change the data.

dvictori commented 1 year ago

I was having this issue when using a Duckdb database "in memory". But following what you did, I'm now creating a duckdb database on disk, inserting a bunch of views that use read_parquet and loading that file on Metabase. It's working just fine and I can use partitioned parquet files.

dvictori commented 1 year ago

I misread you comment. After some testing I realized that if trying to open a partitioned parquet inside metabase, one must declare all fields contained in the files.

For instance, this works:

select REF_BACEN, NU_ORDEM from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

I'm reopening the issue because I think that select * from... should work regardless if we have a single parquet or a partitioned dataset.

In case it helps, I'm attaching the metabase log that I get when I execute the query.

metabase_partitioned_parquet_errorerr.log

Not sure it's related but I'm also seeing this warning when I re-scan the duckdb fields in metabase:

proagrodb_docker-metabase-1  | 2023-09-29 12:13:03,190 WARN query-processor.deprecated :: Atenção: driver :duckdb está usando Honey SQL 1. Este método foi descontinuado em 0.46.0 e será excluído em uma versão futura.

Sorry for the message in portuguese. But it's basically saying that the driver is using Honey SQL1, which was discontinued in version 0.46.0 and will be excluded in the future.

AlexR2D2 commented 1 year ago

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

Hi! Did you try this thing direct in duckdb? Is it working?

dvictori commented 1 year ago

but this does not
select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')
Hi! Did you try this thing direct in duckdb? Is it working?

Yes, it works using duckdb CLI. I pasted the output a couple of comments above

https://github.com/AlexR2D2/metabase_duckdb_driver/issues/16#issuecomment-1730194178

AlexR2D2 commented 10 months ago

Hi, could you try this again using the latest version of metabase plugin?

select * from read_parquet('/Users/alex/Documents/Dev/duckdb/cars_part/**/*.parquet', hive_partitioning=true); or select * from read_parquet('/Users/alex/Documents/Dev/duckdb/cars_part/**/*.parquet');

dvictori commented 8 months ago

Sorry for the long delay in answering. I'm installing metabase and duckdb driver using the following dockerfile. Will that give me the latest version of the metabase plugin?

FROM openjdk:19-buster

ENV MB_PLUGINS_DIR=/home/plugins/

ADD https://downloads.metabase.com/v0.47.5/metabase.jar /home
ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.2.3/duckdb.metabase-driver.jar /home/plugins/

RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar

CMD ["java", "-jar", "/home/metabase.jar"]

dvictori commented 8 months ago

I just tested using a metabase installed as shown above and a partitioned parquet dataset. All worked fine!

The following syntaxes worked:

select * from read_parquet('/data/iris_part/**/*.parquet')
select * from read_parquet('/data/iris_part/**/*')

select * from read_parquet('/data/iris_part/*/*.parquet')
select * from read_parquet('/data/iris_part/*/*')

select * from read_parquet('/data/iris_part/**')

AlexR2D2 / metabase_duckdb_driver

Open partitioned parquet files #16