deephaven / deephaven-core

Deephaven Community Core
Other
257 stars 80 forks source link

Add support to read parquet file metadata through deephaven #6126

Open malhotrashivam opened 2 months ago

malhotrashivam commented 2 months ago

This will help with remotely debugging and understanding the parquet file structure. We can follow the similar API spec as duck_db: https://duckdb.org/docs/data/parquet/overview

malhotrashivam commented 2 months ago

One approach that @rcaudy suggested in the meanwhile:

If you have a raw source table in groovy, you should be able to:

  1. .initialize() it
  2. Get its columnSourceManager field.
  3. Get the Table result of the CSM’s locationTable()
  4. Get the K-V metadata for each file by applying an update("KV = ((io.deephaven.parquet.table.location.ParquetTableLocation) _TableLocation).getParquetKey().getMetadata().getFileMetaData().getKeyValueMetaData()")
devinrsmith commented 2 months ago

It may be useful to write a little standalone utility to print out the FileMetaData as JSON; I've found this little script helpful:

        try (final TMemoryBuffer buffer = new TMemoryBuffer(128)) {
            fileMetaData.write(new TSimpleJSONProtocol(buffer));
            buffer.flush();
            System.out.println(buffer.toString(StandardCharsets.UTF_8));
        } catch (TException e) {
            // ignore
        }