apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.36k stars 2.42k forks source link

[SUPPORT]Hudi-cli cleans show OOM #7979

Open gaoshihang opened 1 year ago

gaoshihang commented 1 year ago

I use hudi-cli(0.11.1 version) to do cleans show command, and I get a OOM exception:

hudi:ds_segments->cleans show
2023-02-15 02:33:00,699 INFO timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20230214035435843__clean__COMPLETED]}
Command failed java.lang.OutOfMemoryError: Java heap space
Exception in thread "Spring Shell" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding.decode(StringCoding.java:215)
    at java.lang.String.<init>(String.java:463)
    at org.apache.avro.util.Utf8.toString(Utf8.java:158)
    at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:322)
    at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:219)
    at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:456)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:191)
    at org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:298)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:183)
    at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:136)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
    at org.apache.avro.specific.SpecificDatumReader.readRecord(SpecificDatumReader.java:123)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
    at org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:354)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:185)
    at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:136)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
    at org.apache.avro.specific.SpecificDatumReader.readRecord(SpecificDatumReader.java:123)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:251)
    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:236)
    at [org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:206](http://org.apache.hudi.common.table.timeline.timelinemetadatautils.deserializeavrometadata%28timelinemetadatautils.java:206/))
    at [org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeHoodieCleanMetadata(TimelineMetadataUtils.java:170](http://org.apache.hudi.common.table.timeline.timelinemetadatautils.deserializehoodiecleanmetadata%28timelinemetadatautils.java:170/))
    at [org.apache.hudi.cli.commands.CleansCommand.showCleans(CleansCommand.java:74](http://org.apache.hudi.cli.commands.cleanscommand.showcleans%28cleanscommand.java:74/))
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
    at org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
2023-02-15 02:36:14,433 INFO support.GenericApplicationContext: Closing org.springframework.context.support.GenericApplicationContext@47ef968d: startup date [Wed Feb 15 02:32:39 UTC 2023]; root of context hierarchy
2023-02-15 02:36:14,435 INFO support.DefaultLifecycleProcessor: Stopping beans in phase 1
2023-02-15 02:36:14,441 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
2023-02-15 02:36:14,441 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
2023-02-15 02:36:14,442 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

Then I checked the code in CleansCommand.java and found that when I do cleans show, it will get all the clean first, and deserialize avro, which causes OOM.

    HoodieActiveTimeline activeTimeline = HoodieCLI.getTableMetaClient().getActiveTimeline();
    HoodieTimeline timeline = activeTimeline.getCleanerTimeline().filterCompletedInstants();
    List<HoodieInstant> cleans = timeline.getReverseOrderedInstants().collect(Collectors.toList());
    List<Comparable[]> rows = new ArrayList<>();
    for (HoodieInstant clean : cleans) {
      HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantDetails(clean).get());
      rows.add(new Comparable[]{clean.getTimestamp(), cleanMetadata.getEarliestCommitToRetain(),
              cleanMetadata.getTotalFilesDeleted(), cleanMetadata.getTimeTakenInMillis()});
      cleanMetadata = null;
    }

    TableHeader header =
        new TableHeader().addTableHeaderField(HoodieTableHeaderFields.HEADER_CLEAN_TIME)
            .addTableHeaderField(HoodieTableHeaderFields.HEADER_EARLIEST_COMMAND_RETAINED)
            .addTableHeaderField(HoodieTableHeaderFields.HEADER_TOTAL_FILES_DELETED)
            .addTableHeaderField(HoodieTableHeaderFields.HEADER_TOTAL_TIME_TAKEN);
    return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, descending, limit, headerOnly, rows);

can we do some optimization here?

yihua commented 1 year ago

Hi @gaoshihang thanks for reporting this. Are able to identify which clean instant causes the OOM exception? How large are the <instant_time>.clean files under .hoodie/ folder? I'm wondering if leveraging Spark to deserialize the clean metadata is going to help here.

gaoshihang commented 1 year ago

Thanks! I will add some log and do some test to identify which clean instant causes this exception and how large it is.

ad1happy2go commented 1 year ago

@gaoshihang Did you hot the chance to work on above. Are you still facing this issue?

ad1happy2go commented 1 year ago

@gaoshihang Gentle ping.