IBM / core-dump-handler

Save core dumps from a Kubernetes Service or RedHat OpenShift to an S3 protocol compatible object store
https://ibm.github.io/core-dump-handler/
MIT License
131 stars 40 forks source link

Allow to disable ZIP compression & fix bug in overwriting info files #114

Closed stonemaster closed 1 year ago

stonemaster commented 1 year ago

This MR fixes two issues seen in a production environment:

No9 commented 1 year ago

Thanks for this PR. My initial reaction is that if the zip is corrupting the core dumps then it should just be turned off rather than having a flag.

That said I'd like to investigate it further if possible. Do you know the the size of file that is failing along with the OS of the host so I can try to reproduce?

Also the merge is blocked as cargo fmt needs to be ran on the code.

stonemaster commented 1 year ago

Thanks for the comment! I just ran cargo fmt and updated the PR.

Your reaction is completely right. I just though I wouldn't want to disable it for everybody because for some sizes it probably works - otherwise I would have suspected much more bug reports. In my case it is a 8GB core dump on an Ubuntu 22.04 machine.

No9 commented 1 year ago

Thanks for the update. Let me see if I can reproduce this over the weekend and get back to you.

stonemaster commented 1 year ago

Hi there, are there any updates on getting this reproduced on your end? Let me know if you need more information!

No9 commented 1 year ago

Haven't had the bandwidth to look at it yet. Too much happening at this time of year. If you have some bandwidth can you create a large core file (could be just random data) and replace test.core then run the tests to see if it reproduces? Thanks

stonemaster commented 1 year ago

I followed your idea and ran the tests locally and I can produce a failed unit test with a random core file of size 512M:

base64 /dev/urandom | head -c 512M >./core-dump-composer/mocks/test.core

Test output:

---- default_scenario stdout ----
The current directory is <...>
Running tests using this PATH: <...>
timeout

ReadDir("./output")
thread 'default_scenario' panicked at 'assertion failed: `(left == right)`
  left: `8`,
 right: `1`', core-dump-composer/tests/default.rs:162:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    default_scenario

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 120.01s

error: test failed, to rerun pass '-p core-dump-composer --test default'

A core file of 256M was working fine BTW.

As a side note: for the tests I always needed to clear the folder core-dump-composer/output otherwise I was getting some weird hickup in the test. Probably not a problem for CI but just wanted to mention it.

No9 commented 1 year ago

Excellent thanks for looking into this and providing reproduction steps. The hickup is intentional for a failed test. It's good to have the failed artifacts when it breaks but it should clean up if it's successful - I'll check it when I look into it.

stonemaster commented 1 year ago

Also: if I change the tests to use the new -D flag introduce with my MR the testcases run through with a 512MB test.core.

No9 commented 1 year ago

Found the problem - The default time out for the thread is 120 seconds and the zip for 512M is taking longer than that. I've bumped the timeout to 600 seconds and the tests now pass with the 512M Can you set the default time out to 600 or 10 mins as part of this PR please. https://github.com/IBM/core-dump-handler/blob/main/core-dump-composer/src/config.rs#L62

Given the impact of compression I think having the option to disable it is a good feature to have so I'm happy to land this PR.

@timbuchwaldt any additional thoughts on the timeout setting change?

stonemaster commented 1 year ago

Okay perfect, thanks for your help! I just pushed a commit that changes the default for the timeout parameter.

No9 commented 1 year ago

Thanks for the change. Can you apply the signoff for DCO please Instructions in this link. https://github.com/IBM/core-dump-handler/pull/114/checks?check_run_id=9292246336

I'll look to merge this over the weekend.

stonemaster commented 1 year ago

Yes, sorry, done! Thank you a lot for taking care!

stonemaster commented 1 year ago

One question: is there a plan when this going to be released in a new minor release? Thanks.

No9 commented 1 year ago

Hey @stonemaster I've made some changes to this PR in order to make it configurable from the chart. These are all rolled up into https://github.com/IBM/core-dump-handler/pull/116 If the changes are ok for you I'll roll a release. You should be able to validate that release by checking out the pr and running it as the image will also be built