Open m-birke opened 1 year ago
Hi!
I just looked into this. Apparently we don't support reading frames from Parquet files yet (see src/runtime/local/kernels/Read.h
) :see_no_evil:
When changing readFrame() to readMatrix() I noticed, since I used the suggested sample inputs, that we don't build Arrow with support for the Snappy codec.
Regards, Mark
I changed the Python script with additional compression algo specification like
compression = "GZIP" | "SNAPPY" | "BROTLI"
pq.write_table(table, destpth, use_dictionary=False, compression=compression)
with readMatrix and all 3 compression algos I get the same error:
[error]: Execution error: Could not read Parquet table
The Snappy compressed sample parquet files from the link above can be read without error by DAPHNE if I compile Arrow with Snappy support. I'll incorporate all Arrow supported compression formats in the next Docker image updates.
Hi @corepointer
With the new Docker image I am able to read parquet files. Tested for snappy, brotli and gzip.
Thank you!
Should we close this issue and open a new one for requesting readFrame() on parquet or remain this open?
KR
As I mentioned in the commit message of c1100d8, this is just a partial fix as the parquet reader seems to be quite limited at the moment.
In my tests with the sample files from the link above I noticed that I get a nice matrix if I use readMatrix()
(because readFrame()
is not supported here. The nice matrix would contain mostly zeros as it does not handle the required data types correctly (e.g., "345.0" would become 0 as this floating point value is stored as string in the parquet file). Furthermore the inputs are converted to csv in memory and then read from there. That csv reader also had some issues parsing (didn't tokenize well in between the commas).
So if it is working for you now (do all values you from your input get read correctly?), you could close the issue. We can also deal with the shortcomings of the reader with a more detailed error description in another bug report.
I just tried a larger file now, and unfortunately it does not properly, at the end of the matrix there are a lot of nan's then instead of the values
It is very strange: sometimes it works properly, sometimes not
Parquet file is existing and can be read with https://parquet-viewer-online.com
Execution of DSL script results in printing a frame with all (10) values nan
OUTPUT:
This is how I created the parquet file:
from a csv file which looks like this:
Sample parquet files can be acquired here (data is much more complex here): https://github.com/kaysush/sample-parquet-files/tree/main