apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Remove support for Hadoop <3.3 #2943

Open Fokko opened 4 months ago

Fokko commented 4 months ago

Describe the enhancement requested

Remove Hadoop 2 support

There is fallback logic in case it needs to seek within a file.

Component(s)

No response

steveloughran commented 1 week ago

I'm going to propose this is done with a cull of the hadoop-2 profile, with other cleanup code done more incrementally.

Or would you want HadoopStreams cleaned up at the same time? It'd be the nice tangible "this is why it is worthwhile" change?

Fokko commented 1 week ago

Hey @steveloughran! As a first PR, I'd love to remove the Hadoop 2 profile and the error-prone reflection. Next, we can do incremental cleanup. The discussion has been open on the dev-list for some time now, let me conclude it over there.

steveloughran commented 1 week ago

+1; will submit both. One thing to consider here is actually dropping the hadoop 3 version to 3.3.0 to guarantee all API/tests are against that version. Avoids any accidental use of newer classes/methods/constants etc.

Fokko commented 1 week ago

One thing to consider here is actually dropping the hadoop 3 version to 3.3.0 to guarantee all API/tests are against that version. Avoids any accidental use of newer classes/methods/constants etc.

Yes, I was also thinking about that. I like that idea (or testing against both 3.3.x and 3.4.x).

steveloughran commented 1 week ago

I've actually been thinking about having a format-test module in Hadoop, which contains basic Parquet, avro &c tests which and then we run against object stores through the S3a, abfs and gcs stores. That way we can identify regressions fast and test against the development branches against live cloud infrastructure. There is also the option of a mini in-process HDFS cluster to test file R/W there... that can be done in parquet today.

steveloughran commented 1 week ago

w.r.t format testing, got some more thoughts there which would actually be

Someone still needs to provide keys for the target stores, so can't be run in the public CI tests...recurrent PITA there