Parquet without Hadoop dependencies

asfimport commented 4 years ago

I have been trying for weeks to create a parquet file from avro and write to S3 in Java. This has been incredibly frustrating and odd as Spark can do it easily (I'm told).

I have assembled the correct jars through luck and diligence, but now I find out that I have to have hadoop installed on my machine. I am currently developing in Windows and it seems a dll and exe can fix that up but am wondering about Linus as the code will eventually run in Fargate on AWS.

Why do I need external dependencies and not pure java?

The thing really is how utterly complex all this is. I would like to create an avro file and convert it to Parquet and write it to S3, but I am trapped in "ParquetWriter" hell!

Why can't I get a normal OutputStream and write it wherever I want?

I have scoured the web for examples and there are a few but we really need some documentation on this stuff. I understand that there may be reasons for all this but I can't find them on the web anywhere. Any help? Can't we get the "SimpleParquet" jar that does this:

ParquetWriter writer = AvroParquetWriter.builder(outputStream) .withSchema(avroSchema) .withConf(conf) .withCompressionCodec(CompressionCodecName.SNAPPY) .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites files). .build();

Environment: Amazon Fargate (linux), Windows development box.

We are writing Parquet to be read by the Snowflake and Athena databases. Reporter: mark juchems Assignee: Atour Mousavi Gourabi / @amousavigourabi

Related issues:

Avoid Hadoop interfaces and classes in codecs (relates to)
Add interface layer between Parquet and Hadoop Configuration (relates to)
Use SeekableByteChannel instead of OutputFile/InputFile Classes (is related to)
PRs and other links:
GitHub Pull Request #1111
Dev Mailing List Discussion

_{Note: This issue was originally created as PARQUET-1822. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Johannes Müller: We're having the exact same issue. We would like to write to parquet for the convenience and the obvious benefits of the format but it just seems impossible to do without a lot of overhead, including a Hadoop installed?

asfimport commented 4 years ago

David Mollitor / @belugabehr: Parquet 2.0 anyone?

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @belugabehr, based on the current community activity I wouldn't say parquet 2.0 is feasible any time soon. :(

asfimport commented 4 years ago

Ben Watson: I also had this problem and I hope I can help you. I maintain an IntelliJ plugin that lets people view Avro and Parquet files. I assumed that creating this plugin would be a trivially easy task, but I've been surprised at just how much effort it is and how many bugs people still keep finding.

I asked about this on Stack Overflow a while ago and got an answer that works. The solution I implemented does have some Hadoop dependencies, but the critical difference is that it uses org.apache.parquet.io.InputFile and does not require org.apache.hadoop.Path. This skips a lot of Hadoop libraries and helped me to avoid a lot of JAR Hell issues. This works on Windows without any additional setup or PATH changes.

Feel free to copy the relevant code from my repo:

https://github.com/benwatson528/intellij-avro-parquet-plugin/blob/master/src/main/java/uk/co/hadoopathome/intellij/viewer/fileformat/ParquetFileReader.java
https://github.com/benwatson528/intellij-avro-parquet-plugin/blob/master/src/main/java/uk/co/hadoopathome/intellij/viewer/fileformat/LocalInputFile.java

Disclaimer: this does not produce valid JSON if Logical Types are used as the date strings are not surrounded by quotes (see this open SO question).

asfimport commented 4 years ago

mark juchems: Ben,

Thanks for you work on this! We use your plugin all the time. It is quite indespensible.

All,

Keep in mind that, at least for us, no haddop install was needed in Fargate on AWS. So we just used these jars and it worked. We of course used the AWS jar to write to S3. We wanted an output stream but couldn't overcome that.

11 3.2.1 org.apache.avro avro 1.9.2 org.apache.parquet parquet-avro 1.11.0 org.apache.hadoop hadoop-common ${hadoop.version} org.apache.hadoop hadoop-client ${hadoop.version} org.apache.hadoop hadoop-aws ${hadoop.version}

Here is our code:

/**

This saves it to the "parquet" folder at the base of this project. */ public void saveIt(List<Map<String, Object>> theData, String fileName) throws Exception { Schema avroSchema = AvroSchemaBuilder.build("pojo", theData.get(0)); System.out.println(avroSchema);

Configuration conf = new Configuration();

Path path = new Path("parquet/" + StringUtils.substringBeforeLast(fileName, ".") + ".parquet"); ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(avroSchema) .withConf(conf) .withCompressionCodec(CompressionCodecName.SNAPPY) .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites files). .build();

for (Map<String, Object> row : theData) { StopWatch stopWatch = StopWatch.createStarted(); final GenericRecord record = new GenericData.Record(avroSchema); row.forEach((k, v) -> { record.put(k, v); }); writer.write(record); } writer.close(); }

asfimport commented 2 years ago

Xinyu Zeng / @XinyuZeng: Hi community, just wondering whether there is any update to this issue?

asfimport commented 7 months ago

Gang Wu / @wgtmac: @amousavigourabi Is this issue complete? Should we resolve it?

asfimport commented 7 months ago

Gang Wu / @wgtmac: I marked this as resolved for 1.14.0 release. Feel free to reopen if this issue is considered incomplete.

asfimport commented 6 months ago

Ryan Rupp / @ryanrupp: Is 1.14.0 to the point that parquet can be used without hadoop-client dependency? I was playing around with it and observed:

The compiler complains about the new method overloads for builder and withConf - even though I'm using the new Parquet interface overloads, the compiler will complain about the hadoop classes not being available. I can "trick" this though by adding hadoop as a provided scope dependency. This is on Java 11 FWIW.
Once past that, ParquetWriter if you're not using encryption the null code path goes down into path that hits Hadoop classes (I temporarily worked around by just removing the encryption settings as I don't use them in this case)
After that I hit I believe I hit PARQUET-2353 but didn't dig into it too far

There's some comments on the PR like this that sound like people are doing this to an extent, maybe dependency on hadoop-common (instead of hadoop-client) or does anyone have an example of minimal hadoop dependencies being pulled in? The tests like TestReadWrite already have Hadoop on the classpath for instance.

Thanks

asfimport commented 6 months ago

Gang Wu / @wgtmac: Hi @ryanrupp, I tried to reproduce the issue and can confirm that I have to add hadoop-client-api as the minimal dependency of hadoop: Maven Repository: org.apache.hadoop » hadoop-client-api (mvnrepository.com). Otherwise, my compiler complained about missing hadoop-related class file.

asfimport commented 5 months ago

Ryan Rupp / @ryanrupp: @wgtmac thanks, do you have any sample code by any chance? If I use hadoop-client-api as a dependency I can get further along (compiles etc.) but it does then result in an exception downstream when trying to go from ParquetConfiguration -> Configuration (Hadoop). It looks like because hadoop-client-api does some shading but it's failing to classload a woodstox class that it's expecting to be shaded but isn't:


java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper
 at org.apache.parquet.hadoop.util.ConfigurationUtil.createHadoopConfiguration(ConfigurationUtil.java:60)
 at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:929)
 // test code
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.com.ctc.wstx.io.InputBootstrapper
 at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
 ... 74 more

I'm guessing because this dependency is really just meant for API users and at runtime the "hadoop-client" with all relevant transitive dependencies would be in place? I'm actually not familiar when you would use hadoop-client-api vs hadoop-client.

apache / parquet-java

Parquet without Hadoop dependencies #2473

Related issues:

PRs and other links: