Open asfimport opened 9 years ago
Ryan Blue / @rdblue:
The commit happens on a single node, but you must have too much metadata to summarize in memory. You can add more memory, or turn off the summary metadata by setting parquet.enable.summary-metadata
to true
in your job configuration.
hy5446: OK, thanks for the reply. Would you know what setting can be used to increase the memory?
Tim / @tsdeng: you can set it as: -Dyarn.app.mapreduce.am.resource.mb=8192 -Dyarn.app.mapreduce.am.command-opts=-Xmx8000m
Qinghui Xu / @qinghui-xu: This looks like not a problem from parquet-mr itself, let's close it?
We're trying to write some 14B rows (about 3.6 TB in parquets) to parquet files. When our ETL job finishes, it throws this exception, and the status is "died in job commit".
2015-05-14 09:24:28,158 FATAL [CommitterEvent Processor #4] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[CommitterEvent Processor #4,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: GC overhead limit exceeded at java.nio.ByteBuffer.wrap(ByteBuffer.java:373) at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) at parquet.format.Statistics.setMin(Statistics.java:237) at parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:243) at parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167) at parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79) at parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423) at parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) at parquet.hadoop.mapred.MapredParquetOutputCommitter.commitJob(MapredParquetOutputCommitter.java:43) at org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:259) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:253) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
This seems to have something to do with the _metadata file creation, as the parquet files are perfectly fine and usable. Also I'm not sure how to alleviate this (i.e. add more heap space) since the crash is outside the Map/Reduce tasks themselves but seems in the job/application controller itself.
Environment: CentOS, MapR,. Scalding Reporter: hy5446
Note: This issue was originally created as PARQUET-282. Please see the migration documentation for further details.