deephaven / deephaven-core

Deephaven Community Core
Other
256 stars 80 forks source link

Empty parquet file leads to Barrage issue #6179

Closed devinrsmith closed 1 month ago

devinrsmith commented 1 month ago

An "empty" parquet file, created via pyarrow, seems to be leading to Barrage writing issues.

from deephaven import parquet

my_table = parquet.read("Empty1.parquet")

This manifests itself as a "waiting for viewport" (seemingly with 2 rows) in the web UI

image

A Flight DoGet looks correct:

$ ./java-client/flight-examples/build/install/java-client-flight-examples/bin/get-table --variable my_table
1 compiler directives added
Schema<Foo: Int(64, true), Bar: Utf8>(metadata: {deephaven:attribute_type.AddOnly=java.lang.Boolean, deephaven:attribute.AddOnly=true})
Table received: 0 rows

A Barrage DoExchange looks incorrect:

$ ./java-client/barrage-examples/build/install/java-client-barrage-examples/bin/subscribe-table --variable my_table
...
Subscription established
Table info: rows = 2, cols = 2
                 Foo|       Bar
--------------------+----------
(null)              |(null)    
(null)              |(null)    

It's possible that the Parquet Table implementation is incorrect in someway (and thus, leads to the Barrage issue). There is special handling on the DH side around empty row groups, and that may be leading to issues?

Here is the file Empty1.parquet.txt, note the .txt was added to make it uploadable to Github. It was generated with the following snippet:

import pyarrow as pa
import pyarrow.parquet as pq

fields = [
    pa.field("Foo", pa.int64()),
    pa.field("Bar", pa.string()),
]

table = pa.table([[] for _ in fields], schema=pa.schema(fields))

pq.write_table(table, "Empty1.parquet", compression="none")

Here is a quick snippet of the data as viewed through DuckDB:

D SELECT * FROM parquet_metadata('Empty1.parquet');
┌─────────────────────┬──────────────┬────────────────────┬──────────────────────┬─────────────────┬───────────┬─────────────┬────────────┬────────────────┬────────────┬───────────┬───────────┬──────────────────┬──────────────────────┬─────────────────┬─────────────────┬──────────────┬────────────┬───────────────────┬──────────────────────┬──────────────────┬──────────────────────┬──────────────────────┬────────────────────┐
│      file_name      │ row_group_id │ row_group_num_rows │ row_group_num_colu…  │ row_group_bytes │ column_id │ file_offset │ num_values │ path_in_schema │    type    │ stats_min │ stats_max │ stats_null_count │ stats_distinct_count │ stats_min_value │ stats_max_value │ compression  │ encodings  │ index_page_offset │ dictionary_page_of…  │ data_page_offset │ total_compressed_s…  │ total_uncompressed…  │ key_value_metadata │
│       varchar       │    int64     │       int64        │        int64         │      int64      │   int64   │    int64    │   int64    │    varchar     │  varchar   │  varchar  │  varchar  │      int64       │        int64         │     varchar     │     varchar     │   varchar    │  varchar   │       int64       │        int64         │      int64       │        int64         │        int64         │  map(blob, blob)   │
├─────────────────────┼──────────────┼────────────────────┼──────────────────────┼─────────────────┼───────────┼─────────────┼────────────┼────────────────┼────────────┼───────────┼───────────┼──────────────────┼──────────────────────┼─────────────────┼─────────────────┼──────────────┼────────────┼───────────────────┼──────────────────────┼──────────────────┼──────────────────────┼──────────────────────┼────────────────────┤
│ /tmp/Empty1.parquet │            0 │                  0 │                    2 │              28 │         0 │          18 │          0 │ Foo            │ INT64      │           │           │                  │                      │                 │                 │ UNCOMPRESSED │ PLAIN, RLE │                   │                    4 │                0 │                   14 │                   14 │ {}                 │
│ /tmp/Empty1.parquet │            0 │                  0 │                    2 │              28 │         1 │          70 │          0 │ Bar            │ BYTE_ARRAY │           │           │                  │                      │                 │                 │ UNCOMPRESSED │ PLAIN, RLE │                   │                   56 │                0 │                   14 │                   14 │ {}                 │
└─────────────────────┴──────────────┴────────────────────┴──────────────────────┴─────────────────┴───────────┴─────────────┴────────────┴────────────────┴────────────┴───────────┴───────────┴──────────────────┴──────────────────────┴─────────────────┴─────────────────┴──────────────┴────────────┴───────────────────┴──────────────────────┴──────────────────┴──────────────────────┴──────────────────────┴────────────────────┘
D SELECT * FROM parquet_kv_metadata('Empty1.parquet');
┌─────────────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│      file_name      │     key      │                                                                                                                    value                                                                                                                     │
│       varchar       │     blob     │                                                                                                                     blob                                                                                                                     │
├─────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ /tmp/Empty1.parquet │ ARROW:schema │ /////6gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABAAAAABAAAANj///8AAAEFEAAAABgAAAAEAAAAAAAAAAMAAABCYXIABAAEAAQAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIQAAAAHAAAAAQAAAAAAAAAAwAAAEZvbwAIAAwACAAHAAgAAAAAAAABQAAAAAAAAAA= │
└─────────────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D SELECT * FROM parquet_schema('Empty1.parquet');
┌─────────────────────┬─────────┬────────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬──────────────┐
│      file_name      │  name   │    type    │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │ logical_type │
│       varchar       │ varchar │  varchar   │   varchar   │     varchar     │    int64     │    varchar     │ int64 │   int64   │  int64   │   varchar    │
├─────────────────────┼─────────┼────────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼──────────────┤
│ /tmp/Empty1.parquet │ schema  │            │             │ REQUIRED        │            2 │                │       │           │          │              │
│ /tmp/Empty1.parquet │ Foo     │ INT64      │             │ OPTIONAL        │              │                │       │           │          │              │
│ /tmp/Empty1.parquet │ Bar     │ BYTE_ARRAY │             │ OPTIONAL        │              │ UTF8           │       │           │          │ StringType() │
└─────────────────────┴─────────┴────────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴──────────────┘
D SELECT * FROM read_parquet('Empty1.parquet');
┌───────┬─────────┐
│  Foo  │   Bar   │
│ int64 │ varchar │
├───────┴─────────┤
│     0 rows      │
└─────────────────┘
nbauernfeind commented 1 month ago

The issue is with the parquet table itself.

print(my_table.j_table.getRowSet())
{0--1}

But it should be empty and displayed as {}.

devinrsmith commented 1 month ago

At a minimum, we should be able to assert that all parquet tables are "flat" - I believe that would have caught this bad row set. In addition, we might ask if the rowset code itself should catch this sort of bad rowset.

rcaudy commented 1 month ago

At a minimum, we should be able to assert that all parquet tables are "flat" - I believe that would have caught this bad row set. In addition, we might ask if the rowset code itself should catch this sort of bad rowset.

We actually can't assert that. The RowSet is an artifact of how our code responds to the arrangement of row groups and their sizes. We will not produce a flat RowSet if a Table is backed by more than one Parquet file or more than one row group.

malhotrashivam commented 1 month ago

The linked parquet file has an empty row group, and we have a known issue that our engine does not support parquet files with empty row groups (#5530).

devinrsmith commented 1 month ago

This does manifest differently though, without an explicit error.

malhotrashivam commented 1 month ago

Yup, the error is different and kind of gets hidden but the root cause is the same. I have added a fix for it in #6183