Closed mhkeller closed 12 months ago
I'd say the problem is that pg2parquet always writes decimals as a fix_len_byte_array parquet type, even if the data would fit into Int64 or Int32. That was easier to implement, as I can write all precisions as single data type, and it was supported by the tools that I tried to use. I'll add support for writing decimals as Int64 for better compatibility (it should also be more efficient). You'll then have to reduce the precision, the default is 128 bit AFAIK.
In the meantime, you can convert the numeric columns to float64 (I think it's called float8 in postgres, for example select value::float8 from my_table
)
I think I get it. So the issue is that I have these values stored as postgres NUMERIC
type. If I export them as real
or double precision
, the file converts properly. So what you're saying is for postgres decimal
and numeric
types, you'll change it so that they export to a parquet int64 type, which should be easier to convert. I imagine you'll add this to the readme but in what scenarios would the conversion lead to a data loss?
I have release new version (0.1.0-beta.7
) which should fix the issue. If you set the --decimal-precision
to a value lower than 19, pg2parquet will serialize it as Int64 which datafusion should support. For example --decimal-precision=18 --decimal-scale=9
is a reasonable setting.
I have also added another option --numeric-handling
. That can be set to double
to serialize all numerics as Float64, which is generally well supported. That might lead to some loss of precision, you can assume that Float64 has precision of about 14 decimal digits. The main advantage is that scaling is adaptive, so you never loose too much precision (while the decimal would round 0.0000000000000000000000000000000000000001
to zero, float will preserve the small value). If you are then loading the data into Javascript number
, you are converting it to Float64 anyway and the loss of precision is irrelevant.
Last option is to use --numeric-handling=string
, that never looses precision, but it's harder to do arithmetics on it.
Thanks for your work on this! I'll give the new version a try!
Glad to help, let me know how it went. I'm now reconsidering which of these options should be the default. Float64 seems like the most universal choice, while String would be the safest way in terms of loss of precision :thinking: I'd prioritize compatibility and ease of use - this tool is for exports, not backups (at least I hope I'm not responsible for some broken backup :sweat_smile:)
edit: Just tried this using --numeric-handling double
and that works well.
I also tried --decimal-precision=18 --decimal-scale=9
but I get very large numbers. Is this by design and I have to handle the conversion elsewhere?
Here's the same data where sepal_length
is set to numeric.
I also tried --decimal-precision=18 --decimal-scale=9 but I get very large numbers. Is this by design and I have to handle the conversion elsewhere?
No, you should see a decimal number. Although, internally it is really stored as an integer with "shifted" decimal point by 9 digits. I tried it out, and it works for me (pg2parquet -> parquet2arrow -> pyarrow). I think it's most likely a bug/unsupported feature in the Arrow library you use - it loads the integer number without understanding the metadata saying it's actually a decimal.
However, if you want to, we can rule out the possibility of pg2parquet misbehaving in your environment. This file: testtable.zip contains a parquet and an arrow file which are "correct" (I hope, it works in pyarrow at least). Could you try to load it into your arrow loader to see if it indeed doesn't support the decimals?
edit: Just tried this using --numeric-handling double and that works well.
Perfect, I guess I'll just make this the default, the decimals seem to be quite broken, and each time in a different way :facepalm:
When I load that arrow file I get this result so I think the issue is in the arrow viewer as you describe. Thanks for helping narrow it down!
I'll file an issue with that library once I get a clean reproduction for them.
To confirm, what are the expected values? Using the apache-arrow node library it's logging out as this:
Full script:
import { readFileSync } from 'fs';
import { tableFromIPC } from 'apache-arrow';
const arrow = readFileSync('testfile-prec.arrow');
const table = tableFromIPC(arrow);
console.table(table.toArray());
When I use pyarrow, I get:
In [1]: import pyarrow as pa
In [2]: f=pa.ipc.open_file("code/pg2parquet/cli/testfile-prec.arrow")
In [3]: all=f.read_all()
In [4]: all
Out[4]:
pyarrow.Table
id: int32
normal: decimal128(18, 9)
high_precision: decimal128(18, 9)
----
id: [[1,2]]
normal: [[1000.000100000,null]]
high_precision: [[1.000000000,null]]
DuckDB sees the same values (it can only read parquet, not arrow AFAIK)
D select * from 'code/pg2parquet/cli/testfile-prec.parquet';
┌───────┬────────────────┬────────────────┐
│ id │ normal │ high_precision │
│ int32 │ decimal(18,9) │ decimal(18,9) │
├───────┼────────────────┼────────────────┤
│ 1 │ 1000.000100000 │ 1.000000000 │
│ 2 │ │ │
└───────┴────────────────┴────────────────┘
Forgive this question if it makes no sense but I am using this package to export a sample dataset and then I want to convert it to an apache arrow format using the library parquet2arrow.
When I convert the file, I get an error saying that it can't find the Arrow schema, which is a little weird.
The error comes from this line.
I've made a reproduction here: https://github.com/mhkeller/convert-parquet-repro
When I read the parquet file in Tad, it appears just fine, but I'm wondering if the arrow conversion needs additional information that could be included by this package when the file is written.
I saw this issue but I'm not sure why such a simple data file would not yet be supported so I have a feeling that isn't the main issue. The types in the iris file are fairly basic.
Thanks for the help.