Closed asfimport closed 7 months ago
Johannes Müller: We're having the exact same issue. We would like to write to parquet for the convenience and the obvious benefits of the format but it just seems impossible to do without a lot of overhead, including a Hadoop installed?
David Mollitor / @belugabehr: Parquet 2.0 anyone?
Gabor Szadovszky / @gszadovszky: @belugabehr, based on the current community activity I wouldn't say parquet 2.0 is feasible any time soon. :(
Ben Watson: I also had this problem and I hope I can help you. I maintain an IntelliJ plugin that lets people view Avro and Parquet files. I assumed that creating this plugin would be a trivially easy task, but I've been surprised at just how much effort it is and how many bugs people still keep finding.
I asked about this on Stack Overflow a while ago and got an answer that works. The solution I implemented does have some Hadoop dependencies, but the critical difference is that it uses org.apache.parquet.io.InputFile
and does not require org.apache.hadoop.Path.
This skips a lot of Hadoop libraries and helped me to avoid a lot of JAR Hell issues. This works on Windows without any additional setup or PATH changes.
Feel free to copy the relevant code from my repo:
Disclaimer: this does not produce valid JSON if Logical Types are used as the date strings are not surrounded by quotes (see this open SO question).
mark juchems: Ben,
Thanks for you work on this! We use your plugin all the time. It is quite indespensible.
All,
Keep in mind that, at least for us, no haddop install was needed in Fargate on AWS. So we just used these jars and it worked. We of course used the AWS jar to write to S3. We wanted an output stream but couldn't overcome that.
Here is our code:
/**
This saves it to the "parquet" folder at the base of this project. */ public void saveIt(List<Map<String, Object>> theData, String fileName) throws Exception { Schema avroSchema = AvroSchemaBuilder.build("pojo", theData.get(0)); System.out.println(avroSchema);
Configuration conf = new Configuration();
Path path = new Path("parquet/" + StringUtils.substringBeforeLast(fileName, ".") + ".parquet");
ParquetWriter writer = AvroParquetWriter.
for (Map<String, Object> row : theData) { StopWatch stopWatch = StopWatch.createStarted(); final GenericRecord record = new GenericData.Record(avroSchema); row.forEach((k, v) -> { record.put(k, v); }); writer.write(record); } writer.close(); }
Xinyu Zeng / @XinyuZeng: Hi community, just wondering whether there is any update to this issue?
Gang Wu / @wgtmac: @amousavigourabi Is this issue complete? Should we resolve it?
Gang Wu / @wgtmac: I marked this as resolved for 1.14.0 release. Feel free to reopen if this issue is considered incomplete.
Ryan Rupp / @ryanrupp: Is 1.14.0 to the point that parquet can be used without hadoop-client dependency? I was playing around with it and observed:
builder
and withConf
- even though I'm using the new Parquet interface overloads, the compiler will complain about the hadoop classes not being available. I can "trick" this though by adding hadoop as a provided scope dependency. This is on Java 11 FWIW.After that I hit I believe I hit PARQUET-2353 but didn't dig into it too far
There's some comments on the PR like this that sound like people are doing this to an extent, maybe dependency on hadoop-common (instead of hadoop-client) or does anyone have an example of minimal hadoop dependencies being pulled in? The tests like TestReadWrite already have Hadoop on the classpath for instance.
Thanks
Gang Wu / @wgtmac: Hi @ryanrupp, I tried to reproduce the issue and can confirm that I have to add hadoop-client-api as the minimal dependency of hadoop: Maven Repository: org.apache.hadoop » hadoop-client-api (mvnrepository.com). Otherwise, my compiler complained about missing hadoop-related class file.
Ryan Rupp / @ryanrupp:
@wgtmac thanks, do you have any sample code by any chance? If I use hadoop-client-api
as a dependency I can get further along (compiles etc.) but it does then result in an exception downstream when trying to go from ParquetConfiguration -> Configuration (Hadoop). It looks like because hadoop-client-api
does some shading but it's failing to classload a woodstox class that it's expecting to be shaded but isn't:
java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper
at org.apache.parquet.hadoop.util.ConfigurationUtil.createHadoopConfiguration(ConfigurationUtil.java:60)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:929)
// test code
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.com.ctc.wstx.io.InputBootstrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 74 more
I'm guessing because this dependency is really just meant for API users and at runtime the "hadoop-client" with all relevant transitive dependencies would be in place? I'm actually not familiar when you would use hadoop-client-api vs hadoop-client.
I have been trying for weeks to create a parquet file from avro and write to S3 in Java. This has been incredibly frustrating and odd as Spark can do it easily (I'm told).
I have assembled the correct jars through luck and diligence, but now I find out that I have to have hadoop installed on my machine. I am currently developing in Windows and it seems a dll and exe can fix that up but am wondering about Linus as the code will eventually run in Fargate on AWS.
Why do I need external dependencies and not pure java?
The thing really is how utterly complex all this is. I would like to create an avro file and convert it to Parquet and write it to S3, but I am trapped in "ParquetWriter" hell!
Why can't I get a normal OutputStream and write it wherever I want?
I have scoured the web for examples and there are a few but we really need some documentation on this stuff. I understand that there may be reasons for all this but I can't find them on the web anywhere. Any help? Can't we get the "SimpleParquet" jar that does this:
ParquetWriter writer = AvroParquetWriter.builder(outputStream)
.withSchema(avroSchema)
.withConf(conf)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites files).
.build();
Environment: Amazon Fargate (linux), Windows development box.
We are writing Parquet to be read by the Snowflake and Athena databases. Reporter: mark juchems Assignee: Atour Mousavi Gourabi / @amousavigourabi
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-1822. Please see the migration documentation for further details.