Elektrobit / flake-pilot

Registration/Control utility for applications launched through a runtime-engine, e.g containers
MIT License
9 stars 5 forks source link

Better error handling #133

Closed Ichmed closed 1 year ago

Ichmed commented 1 year ago

Builds on top of #131

Replaces all unwrap(), panic!() and exit() calls in podman.rs with Errors that are propagated to back up to the main function.

  1. Makes function API cleaner and more consistent
  2. Clearly marks which functions can error
  3. Result is a 'must_use' type, so errors can no longer be missed on accident
  4. Enables cleanup of an error state in the future by not terminating immediately

Since all errors simply propagate back up to main() the external behavior should be completely unchanged except for some hard coded error messages (could be added back in if needed).

Sorry for the big diff, but this time it's not possible to make it smaller since almost all return types were changed. Diff could be reduced in size by only adding Result to some methods, but this would only needlessly complicate the process in my eyes.

schaefi commented 1 year ago

Thanks much, a very valuable change. I understand that it cannot be done smaller. Still not so easy to review though. Actually I was looking only on behavior change because I knew in some cases errors were ignored because, either ok to continue or required to keep the subsequent cleanup code running.

It's hard to check if the changes does not introduce a behavior change in this regard. I guess we only find out by testing. It looks good to me but can we go to the following sequence of tests:

I think there is more but I know in case of errors on the above a cleanup sequence has to run

I will do testing too... but need time and I'm also on vacation next week. I'll try to check on it the next days but might be blocked due to other work

Ichmed commented 1 year ago

I changed the behaviour of gc, it should now work the same as before.

But "not crashing on gc" being the problem seems to hint at a deeper problem, no? Because the pilot should work independently of the garbage collection, if it can't access the directory it should crash at a later point anyway.

Ichmed commented 1 year ago

The failing Github Action is the result of an update to https://crates.io/crates/strip-ansi-escapes on 08.08.2023 that changed the return type of strip from Result<Vec<u8>> to Vec<u8>

schaefi commented 1 year ago

I changed the behaviour of gc, it should now work the same as before.

But "not crashing on gc" being the problem seems to hint at a deeper problem, no? Because the pilot should work independently of the garbage collection, if it can't access the directory it should crash at a later point anyway.

I agree with you a refactoring of that part makes sense just not in this pull request please :)

schaefi commented 1 year ago

The failing Github Action is the result of an update to https://crates.io/crates/strip-ansi-escapes on 08.08.2023 that changed the return type of strip from Result<Vec<u8>> to Vec<u8>

Hmm, this killed the bill of materials creation. I wondered why apt install cargo leads to a recompilation of cargo. As the recompile fetches from the crates index and then fails to compile due to the change you pointed out, it sounds relatively severe doesn't it ?

Other than disable the sbom action I don't see any fix that we could provide. What's your take on that ?

changed the return type of strip from Result<Vec<u8>> to Vec<u8>

shouldn't this sort of changes be forbidden ? I mean that kills many code or am I mistaken ?

schaefi commented 1 year ago

ok, I repeated my test with the changes you did. So if I now run the as normal user I get

ms@asterix:/home/ms
> ./ubuntu 
[2023-08-09T08:54:33Z ERROR podman_pilot] Permission denied (os error 13)

This is ok but to be honest a weak error message. Prior the change this was a panic call and it included more details that was also controllable via RUST_BACKTRACE. So we get this

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }', podman-pilot/src/podman.rs:881:59

So this is also not a nice error message but imho better as what we have now ;) I think if we change the mechanics here the error message itself needs to be more clear

schaefi commented 1 year ago

This was the first finding. The next finding is more severe because the app now doesn't start if the permissions were available. So when I call

ms@asterix:/home/ms
> sudo ./ubuntu 
✓ Launching flake            

==> here it hangs

I had to hit Ctrl-C to get out of the hanging state

schaefi commented 1 year ago

@Ichmed when you are coding these changes do you actually test the tool ? In case you have issues testing let me know. Thanks

Ichmed commented 1 year ago

Usually yes, but i did not have time yesterday and i still need some help with the more complex setups.

As for the breaking change, yes that's a semver violation afaik since it changes the public API.

The error messages can be as detailed as we want of course, but maybe we should move them to a separate PR since the big ugly change here is turning everything into a Result, changing the error types or even just messages later is a much smaller change.

@schaefi what images are you using to test with? I'm also currently working on a test suit that could use some simple images that are publicly available

schaefi commented 1 year ago

The error messages can be as detailed as we want of course, but maybe we should move them to a separate PR since the big ugly change here is turning everything into a Result, changing the error types or even just messages later is a much smaller change.

ok, I can live with that

@schaefi what images are you using to test with? I'm also currently working on a test suit that could use some simple images that are publicly available

I either use the information from the quickstart here: https://github.com/Elektrobit/flake-pilot#oci or for testing of deltas and layers I have setup a project with images and containers publicly. So you can do the following as an example:

sudo flake-ctl podman pull --uri registry.opensuse.org/home/marcus.schaefer/delta_containers/containers_suse/joe

flake-ctl podman register \
    --container joe \
    --target /usr/bin/joe \
    --app $HOME/joe \
    --base registry.opensuse.org/home/marcus.schaefer/delta_containers/containers_suse/basesystem

sudo $HOME/joe

This will auto fetch the base container, setup the delta through the pilot and runs the joe editor

It's a good test for quite some functions of the pilot and everyone can do it because it's all public

Hope this helps

Thanks

schaefi commented 1 year ago

As for the breaking change, yes that's a sever violation afaik since it changes the public API.

ok so we are on the same page :) What should we do ? I suggest a commit here that disabled the SBOM action. We can enable it again when things starts to work upstream. I cannot imagine that they won't fix it ;)

@Ichmed I also noticed you are still working from a fork. Do you have a particular reason to do so ? You should have full access permissions to work from branches of the upstream git. I think this would be a bit easier also for me to test changes. But I of course leave this up to you

Ichmed commented 1 year ago

Yes we should just ignore the action for now. To be honest the crate version should be yanked, i will see if i can contact the maintainer or raise an issue if there isn't already. EDIT: They are already on it

No particular need for the fork, I'm just used to doing it that way so the main repo isn't full of half baked feature branches but i still have my code somewhere else than local.

Ichmed commented 1 year ago

I found the error, using Command::output() instead of Command::status() blocks because the program doesnt not send EOF, replacing perform() with status() fixes this but i will create a more sophisticated solution tomorrow