exyi / pg2parquet

Export PostgreSQL table or query into Parquet file
Apache License 2.0
57 stars 11 forks source link

Support postgres NUMERIC type fields #10

Closed mhkeller closed 12 months ago

mhkeller commented 1 year ago

Forgive this question if it makes no sense but I am using this package to export a sample dataset and then I want to convert it to an apache arrow format using the library parquet2arrow.

When I convert the file, I get an error saying that it can't find the Arrow schema, which is a little weird.

Error: Failed to get Arrow schema from Parquet

Caused by:
    Arrow: Unable to convert parquet BYTE_ARRAY logical type Some(DECIMAL(DecimalType { scale: 18, precision: 38 })) or converted type DECIMAL

The error comes from this line.

I've made a reproduction here: https://github.com/mhkeller/convert-parquet-repro

When I read the parquet file in Tad, it appears just fine, but I'm wondering if the arrow conversion needs additional information that could be included by this package when the file is written.

I saw this issue but I'm not sure why such a simple data file would not yet be supported so I have a feeling that isn't the main issue. The types in the iris file are fairly basic.

Thanks for the help.

exyi commented 1 year ago

I'd say the problem is that pg2parquet always writes decimals as a fix_len_byte_array parquet type, even if the data would fit into Int64 or Int32. That was easier to implement, as I can write all precisions as single data type, and it was supported by the tools that I tried to use. I'll add support for writing decimals as Int64 for better compatibility (it should also be more efficient). You'll then have to reduce the precision, the default is 128 bit AFAIK.

In the meantime, you can convert the numeric columns to float64 (I think it's called float8 in postgres, for example select value::float8 from my_table)

mhkeller commented 1 year ago

I think I get it. So the issue is that I have these values stored as postgres NUMERIC type. If I export them as real or double precision, the file converts properly. So what you're saying is for postgres decimal and numeric types, you'll change it so that they export to a parquet int64 type, which should be easier to convert. I imagine you'll add this to the readme but in what scenarios would the conversion lead to a data loss?

exyi commented 12 months ago

I have release new version (0.1.0-beta.7) which should fix the issue. If you set the --decimal-precision to a value lower than 19, pg2parquet will serialize it as Int64 which datafusion should support. For example --decimal-precision=18 --decimal-scale=9 is a reasonable setting.

I have also added another option --numeric-handling. That can be set to double to serialize all numerics as Float64, which is generally well supported. That might lead to some loss of precision, you can assume that Float64 has precision of about 14 decimal digits. The main advantage is that scaling is adaptive, so you never loose too much precision (while the decimal would round 0.0000000000000000000000000000000000000001 to zero, float will preserve the small value). If you are then loading the data into Javascript number, you are converting it to Float64 anyway and the loss of precision is irrelevant.

Last option is to use --numeric-handling=string, that never looses precision, but it's harder to do arithmetics on it.

mhkeller commented 12 months ago

Thanks for your work on this! I'll give the new version a try!

exyi commented 11 months ago

Glad to help, let me know how it went. I'm now reconsidering which of these options should be the default. Float64 seems like the most universal choice, while String would be the safest way in terms of loss of precision :thinking: I'd prioritize compatibility and ease of use - this tool is for exports, not backups (at least I hope I'm not responsible for some broken backup :sweat_smile:)

mhkeller commented 10 months ago

edit: Just tried this using --numeric-handling double and that works well.

Screenshot 2023-11-09 at 11 57 05 PM

I also tried --decimal-precision=18 --decimal-scale=9 but I get very large numbers. Is this by design and I have to handle the conversion elsewhere?

Here's the same data where sepal_length is set to numeric.

Screenshot 2023-11-09 at 11 55 25 PM
exyi commented 10 months ago

I also tried --decimal-precision=18 --decimal-scale=9 but I get very large numbers. Is this by design and I have to handle the conversion elsewhere?

No, you should see a decimal number. Although, internally it is really stored as an integer with "shifted" decimal point by 9 digits. I tried it out, and it works for me (pg2parquet -> parquet2arrow -> pyarrow). I think it's most likely a bug/unsupported feature in the Arrow library you use - it loads the integer number without understanding the metadata saying it's actually a decimal.

However, if you want to, we can rule out the possibility of pg2parquet misbehaving in your environment. This file: testtable.zip contains a parquet and an arrow file which are "correct" (I hope, it works in pyarrow at least). Could you try to load it into your arrow loader to see if it indeed doesn't support the decimals?

edit: Just tried this using --numeric-handling double and that works well.

Perfect, I guess I'll just make this the default, the decimals seem to be quite broken, and each time in a different way :facepalm:

mhkeller commented 10 months ago

When I load that arrow file I get this result so I think the issue is in the arrow viewer as you describe. Thanks for helping narrow it down!

I'll file an issue with that library once I get a clean reproduction for them.

Screenshot 2023-11-10 at 11 01 25 AM
mhkeller commented 10 months ago

To confirm, what are the expected values? Using the apache-arrow node library it's logging out as this:

Screenshot 2023-11-10 at 11 13 22 AM

Full script:

import { readFileSync } from 'fs';
import { tableFromIPC } from 'apache-arrow';

const arrow = readFileSync('testfile-prec.arrow');
const table = tableFromIPC(arrow);

console.table(table.toArray());
exyi commented 10 months ago

When I use pyarrow, I get:

In [1]: import pyarrow as pa

In [2]: f=pa.ipc.open_file("code/pg2parquet/cli/testfile-prec.arrow")

In [3]: all=f.read_all()

In [4]: all
Out[4]: 
pyarrow.Table
id: int32
normal: decimal128(18, 9)
high_precision: decimal128(18, 9)
----
id: [[1,2]]
normal: [[1000.000100000,null]]
high_precision: [[1.000000000,null]]

DuckDB sees the same values (it can only read parquet, not arrow AFAIK)

D select * from 'code/pg2parquet/cli/testfile-prec.parquet';
┌───────┬────────────────┬────────────────┐
│  id   │     normal     │ high_precision │
│ int32 │ decimal(18,9)  │ decimal(18,9)  │
├───────┼────────────────┼────────────────┤
│     1 │ 1000.000100000 │    1.000000000 │
│     2 │                │                │
└───────┴────────────────┴────────────────┘