apache / incubator-heron

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
https://heron.apache.org/
Apache License 2.0
3.65k stars 599 forks source link

Removed random long in filename which caused leaking in upload storage #3838

Closed nicknezis closed 2 years ago

nicknezis commented 2 years ago

Fixes #3769

If you exec into the Bookkeeper pod, you can run the following command in /opt/bookkeeper/bin ./dlog tool watch -u distributedlog://heron-zookeeper:2181/heron

When I upload the same package, I can see it deleting the old one, and then uploading the new one

Old streams :
acking-root-tag-0.tar.gz
New streams :

Old streams :
New streams :
acking-root-tag-0.tar.gz

Without this fix, this bit of code is never executed. The package name never matches, so it never deletes the old package.

nicknezis commented 2 years ago

Also reverting this change which was added 2 years ago as a bandaid to remedy #3769 .

windhamwong commented 2 years ago

The commit looks good. Just wonder what was the original reason of using a random name in bookkeeper upload?

nicknezis commented 2 years ago

The commit looks good. Just wonder what was the original reason of using a random name in bookkeeper upload?

It looks like a bit of code added early in this PR. Someone suggested they use a timestamp instead of random to see which package was newest. But it doesn't seem to have much thought put behind the original Random Long. I'll have to look further back in the history to determine when it was added.

nwangtw commented 2 years ago

Based on the unit test: testGenerateFilename, it seems like the file name is expected to be unique.

nicknezis commented 2 years ago

Based on the unit test: testGenerateFilename, it seems like the file name is expected to be unique.

Interesting, the test still passes on my local MacOS. I'll have to dig into it further. I'll have to make it fail the way it did in TravicCI, and then fix the logic. Hopefully just a simple environment issue on my end.

Looking in each Uploader, they seem to have logic for uploading a package with the same name. But this logic would never be needed if the intention was to always generate a unique filename. I really think this is a relic from the old Twitter code base. It doesn't seem to provide any value.

But it does cause a storage leak in the Bookkeeper and Zookeeper. It is not as big a deal in Zookeeper, but we have definitely seen Bookkeeper fill up with all of the old artifacts. Especially when on a developer cluster where the same analytic job might be resubmitted multiple times through the development process.

Edit: I figured out my issue. I wasn't running heron test heron/spi/... because I was only running in heron/uploaders/.... :)

nwangtw commented 2 years ago

Yeah. It sounds a little strange. Maybe the logic is assuming manual cleanup or some kind of TTL?