apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr #29986

Open asfimport opened 2 years ago

asfimport commented 2 years ago

have a couple of files  and using  pyarrow.table (0.17) to save it as parquet on disk (parquet version 1.4)

colums id : string val : string

table = pa.Table.from_pandas(df) pq.write_table(table, "df.parquet", version='1.0', flavor='spark', write_statistics=True, )

However, Hive and Spark does not recognize the parquet version:

org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*) )?(build ?(.*)) ` at org.apache.parquet.VersionParser.parse(VersionParser.java:112) \ at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) \ at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)`

 

It seems related to this issue:

It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow was even started. The underlying C++ code does allow this created_by field to be customized source but the python wrapper does not expose this source

  

EDIT Add infos

Current python wrapper does NOT expose :  created_by builder  (when writing parquet on disk)

https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361

 

But, this is available in CPP version:

https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249

https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320

 

This creates an issue when Hadoop parquet reader reads this pyarrow parquet file:

 

 

SO Question here: https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140

 

Reporter: Kevin

Related issues:

Note: This issue was originally created as ARROW-14422. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: The python change should be pretty straightforward (although it will add yet another keyword option to a rather long list)

@emkornfield do you know off the top of your head if there are any further gotchas that will likely be encountered trying to create files for a parquet-mr version this old? Is there a compatibility table anywhere with minimum version support?

asfimport commented 2 years ago

Kevin: Alternatively, is there a way to 'overwrite' parquet-mr  java Jars files

in Hadoop from client side (ie Hive ...)

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: In principle, there should be no need to expose this in Python, since you can't actually influence who is creating the file. Of course, if that value gives problems in other software, that could be a reason. But then we should maybe rather consider changing that value in C++. But, this is something we actually already did recently in the 4.0 release (ARROW-7830). So it might be that updating your pyarrow version could also fix the issue.

asfimport commented 2 years ago

Kevin: In fastparquet, there is a flag export=“hive”, which makes compatible with parquet-mr reader. As a workaround, using fastparquet, although keeping pyarrow would be better.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: Upgrading your pyarrow version is no option?

asfimport commented 2 years ago

Kevin: Sure, which version ? (Because cannot see changes in recent).

FYI, Fastparquet export works well file_scheme keyword: hive-style output is a directory with a single metadata file and several data-files. 70

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

which version ?

I think at minimum to pyarrow 4.0 (as mentioned above, that's the version when we changed what to write to the created_by field)

file_scheme keyword: hive-style output is a directory with a single metadata file and several data-files.

That's unrelated to this issue. If you want to write a partitioned hive-style parquet dataset with pyarrow, you can use pq.write_to_dataset instead of pq.write_table.
The reason that fastparquet works for you is because it sets something else in the created_by field.

asfimport commented 2 years ago

Weston Pace / @westonpace: The current created_by output won't help. PARQUET-349 means that the parquet-mr reader will fail unless the created_by string contains the word "build".

I agree that adding the word "build" to the C++ created_by string would be another way to solve this issue. We could change "parquet-cpp-arrow version 6.0.0-SNAPSHOT" to "parquet-cpp-arrow build 6.0.0-SNAPSHOT" but I don't know how I feel about that either.

asfimport commented 2 years ago

Weston Pace / @westonpace: Actually the regex is


"(.+) version ((.*) )?\\(build ?(.*)\\)";

so we would need "parquet-cpp-arrow version 6.0.0 (build SNAPSHOT)"

asfimport commented 2 years ago

Kevin: Will it work with Hive 0.13 ?

Maintaining some regression test between pyarron export and “parquet-mr” maybbebuseful

scheme=“hive” in fastparquet is very useful

asfimport commented 2 years ago

Micah Kornfield / @emkornfield:

Maintaining some regression test between pyarron export and “parquet-mr” maybbebuseful Agreed, there was some proposals but it appears no one has had time to devote to this.  I'm not sure in this case since as Weston pointed out, the version of broken parquet and we would likely only test a few versions.

 

I agree that adding the word "build" to the C++ created_by string would be another way to solve this issue. We could change "parquet-cpp-arrow version 6.0.0-SNAPSHOT" to "parquet-cpp-arrow build 6.0.0-SNAPSHOT" but I don't know how I feel about that either. I'd be more in favor of of adding a build string to C++ then exposing the flag in python (at least if we expose the flag in python (or at least we would need to validate the flag in python to see if it is parseable.  In general, I think this is fairly low level so I'd be hesitant to expose it in more places.  Using the BUILD field to hold the SHA git hash could be interesting.

asfimport commented 2 years ago

Kevin: To confirm on my side:

This is a medata-isssue when reading with Hive.

0) fastparquet : metadata + parquet files Hive can read it.

1) Using Fastparquet metadata (“hive”) with pyarrow parquet files , Hive is able to read it correctly.

This is the fastparquet hive metadata https://github.com/dask/fastparquet/blob/efd3fd19a9f0dcf91045c31ff4dbb7cc3ec504f2/fastparquet/writer.py#L940

Alternatively, it would make sense to add export =“hive” in addition to “spark” OR make the spark compatible.

Thanks

asfimport commented 2 years ago

Micah Kornfield / @emkornfield: fastparquet created_by string has the "build" as part of its string. I'd guess what is happening is that Hive only looks at the metafile and then doesn't try to parse the create_by version in the data files if the metafile is present (it only parse the metafile value).

asfimport commented 2 years ago

nero: Hi there, I face a related issue when I write a parquet file by PyArrow.

In the old version of Hive, it can only recognize the timestamp type stored in INT96, so I use table.write_to_data with use_deprecated_int96_timestamps=True option to save the parquet file. But the hive SQL will skip timestamp conversion when the metadata of parquet file is not created_by "parquet-mr".

hive/ParquetRecordReaderBase.java at f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive (github.com)

 

So I have to save the timestamp columns with timezone info.

But when pyarrow.parquet read from a dir which contains parquets created by both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for parquet-mr file.

 

Maybe PyArrow can expose the created_by option? Or handle timestamp type with timezone which files created by parquet-mr?

 

 

 

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: [~amznero] that seems like a different issue. Can you open a different JIRA for this?

asfimport commented 2 years ago

nero: @jorisvandenbossche I create a new Jira to describe this issue.

 

[ARROW-15492] [Python] handle timestamp type in parquet file for compatibility with older HiveQL - ASF JIRA (apache.org)