daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 59 forks source link

Read from parquet does not work #587

Open m-birke opened 1 year ago

m-birke commented 1 year ago

Parquet file is existing and can be read with https://parquet-viewer-online.com

Execution of DSL script results in printing a frame with all (10) values nan

// read_from_parquet.daph
group_data = readFrame("/path/to/data.parquet");

print(group_data);

OUTPUT:

Frame(5x2, [Id:double, Vds:double])
nan nan
nan nan
nan nan
nan nan
nan nan
// /path/to/data.parquet.meta
{
  "numRows": 5,
  "numCols": 2,
  "schema": [
    {
      "label": "Id",
      "valueType": "f64"
    },
    {
      "label": "Vds",
      "valueType": "f64"
    }
  ]
}

This is how I created the parquet file:

import pyarrow.parquet as pq
from pathlib import Path
import pyarrow.csv as csv

def main(path: str):

    p = Path(path)
    ro = csv.ReadOptions(column_names = ["Id", "Vds"])
    table = csv.read_csv(p, read_options=ro)    

    print("Arrow table from csv ----------------------------------------------------------------------------")
    print(f"Num cols in table: {table.num_columns}")
    print(f"Num rows in table: {table.num_rows}")
    print(table)
    print("Writing arrow table -----------------------------------------------------------------------------")
    destpth = p.with_suffix(".parquet")
    print(f"writing to {destpth.resolve()}")
    pq.write_table(table, destpth, use_dictionary=False)
    print("Reading back again from parquet file-------------------------------------------------------------")
    rtable=pq.read_table(p.with_suffix(".parquet"))
    print(rtable)
    print("pandas repr.:")
    print(rtable.to_pandas())

from a csv file which looks like this:

0.1184805,4.2727
0.026556,4.2356
-0.0653686,4.248
-0.0347271,4.248
-0.0040855,4.6257

Sample parquet files can be acquired here (data is much more complex here): https://github.com/kaysush/sample-parquet-files/tree/main

corepointer commented 1 year ago

Hi! I just looked into this. Apparently we don't support reading frames from Parquet files yet (see src/runtime/local/kernels/Read.h) :see_no_evil: When changing readFrame() to readMatrix() I noticed, since I used the suggested sample inputs, that we don't build Arrow with support for the Snappy codec. Regards, Mark

m-birke commented 1 year ago

I changed the Python script with additional compression algo specification like

compression = "GZIP" | "SNAPPY" | "BROTLI"
pq.write_table(table, destpth, use_dictionary=False, compression=compression)

with readMatrix and all 3 compression algos I get the same error:

[error]: Execution error: Could not read Parquet table

corepointer commented 1 year ago

The Snappy compressed sample parquet files from the link above can be read without error by DAPHNE if I compile Arrow with Snappy support. I'll incorporate all Arrow supported compression formats in the next Docker image updates.

m-birke commented 1 year ago

Hi @corepointer

With the new Docker image I am able to read parquet files. Tested for snappy, brotli and gzip.

Thank you!

Should we close this issue and open a new one for requesting readFrame() on parquet or remain this open?

KR

corepointer commented 1 year ago

As I mentioned in the commit message of c1100d8, this is just a partial fix as the parquet reader seems to be quite limited at the moment. In my tests with the sample files from the link above I noticed that I get a nice matrix if I use readMatrix() (because readFrame() is not supported here. The nice matrix would contain mostly zeros as it does not handle the required data types correctly (e.g., "345.0" would become 0 as this floating point value is stored as string in the parquet file). Furthermore the inputs are converted to csv in memory and then read from there. That csv reader also had some issues parsing (didn't tokenize well in between the commas). So if it is working for you now (do all values you from your input get read correctly?), you could close the issue. We can also deal with the shortcomings of the reader with a more detailed error description in another bug report.

m-birke commented 1 year ago

I just tried a larger file now, and unfortunately it does not properly, at the end of the matrix there are a lot of nan's then instead of the values

m-birke commented 1 year ago

It is very strange: sometimes it works properly, sometimes not