apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

OutOfMemoryError in job commit / ParquetMetadataConverter #1436

Open asfimport opened 9 years ago

asfimport commented 9 years ago

We're trying to write some 14B rows (about 3.6 TB in parquets) to parquet files. When our ETL job finishes, it throws this exception, and the status is "died in job commit".

2015-05-14 09:24:28,158 FATAL [CommitterEvent Processor #4] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[CommitterEvent Processor #4,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: GC overhead limit exceeded at java.nio.ByteBuffer.wrap(ByteBuffer.java:373) at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) at parquet.format.Statistics.setMin(Statistics.java:237) at parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:243) at parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167) at parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79) at parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423) at parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) at parquet.hadoop.mapred.MapredParquetOutputCommitter.commitJob(MapredParquetOutputCommitter.java:43) at org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:259) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:253) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

This seems to have something to do with the _metadata file creation, as the parquet files are perfectly fine and usable. Also I'm not sure how to alleviate this (i.e. add more heap space) since the crash is outside the Map/Reduce tasks themselves but seems in the job/application controller itself.

Environment: CentOS, MapR,. Scalding Reporter: hy5446

Note: This issue was originally created as PARQUET-282. Please see the migration documentation for further details.

asfimport commented 9 years ago

Ryan Blue / @rdblue: The commit happens on a single node, but you must have too much metadata to summarize in memory. You can add more memory, or turn off the summary metadata by setting parquet.enable.summary-metadata to true in your job configuration.

asfimport commented 9 years ago

hy5446: OK, thanks for the reply. Would you know what setting can be used to increase the memory?

asfimport commented 9 years ago

Tim / @tsdeng: you can set it as: -Dyarn.app.mapreduce.am.resource.mb=8192 -Dyarn.app.mapreduce.am.command-opts=-Xmx8000m

asfimport commented 5 years ago

Qinghui Xu / @qinghui-xu: This looks like not a problem from parquet-mr itself, let's close it?