apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.89k stars 4.27k forks source link

Avoid holding data elements alive via stack frame gc roots. #33086

Open scwhittle opened 2 weeks ago

scwhittle commented 2 weeks ago

This is accomplished by changing to iterators which throw away elements that have been passed as well as just referring to the inputstream of the element instead of the entire stream. If the ByteString is based upon a ByteBuffer this will allow blocks that have been advanced past to be gc'd.

This isn't done for the inlined elements to process data as that request is kept alive in many other places. I investigated fixing it but it resulted in some complicated code and since inlining is only performed for small-enough inputs it doesn't seem worth it.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

scwhittle commented 2 weeks ago

R: @robertwb

github-actions[bot] commented 2 weeks ago

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

scwhittle commented 2 days ago

Run Java PreCommit