canonical / multipass

Multipass orchestrates virtual Ubuntu instances
https://multipass.run
GNU General Public License v3.0
7.54k stars 635 forks source link

Thread-safe SSH session and process #3457

Closed ricab closed 2 months ago

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 93.33333% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 88.85%. Comparing base (bb04d60) to head (a26ac42).

Files Patch % Lines
src/ssh/ssh_process.cpp 92.85% 3 Missing :warning:
src/daemon/daemon.cpp 0.00% 1 Missing :warning:
src/sshfs_mount/sftp_server.cpp 85.71% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #3457 +/- ## ========================================== + Coverage 88.83% 88.85% +0.02% ========================================== Files 253 253 Lines 14049 14098 +49 ========================================== + Hits 12480 12527 +47 - Misses 1569 1571 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ricab commented 3 months ago

OK I am declaring this ready for review, but I would like to hear opinions on something.

As I have it, an SSHSession locks when creating an SSHProcess and passes the lock to the process. The process only releases the lock when someone asks for the exit_code() or when it is destroyed (or if a friend [SftpServer] asks to release the channel).

If someone asks for the exit code again, and if we got it the first time, the known code is returned. However, if we timeout obtaining the exit code the first time, the lock is released and the session can be used by other threads, even though the exit code can be requested again. At that point, the session is no longer thread safe.

I could instead hold the lock until the exit code is successfully obtained and provide a knob for client code to release the lock manually on timeout. But if the client failed to do this, the program would deadlock when trying to destroy the session. I could also invalidate the ssh process and refuse to look for the exit code again, but then we couldn't recover it. What do you think is the best approach for this?

townsend2010 commented 3 months ago

Hey @ricab!

Looks like you're going to need a private side PR: https://github.com/canonical/multipass-private/actions/runs/8543123264/job/23406233450#step:22:627

ricab commented 3 months ago

Ah shoot... OK, I'll try to do that soon.

townsend2010 commented 3 months ago

Hey @ricab!

I've given your question some thought here and IMO, we shouldn't really allow calling exit_status() again after a timeout. A timeout to me means that the SSHProcess is no longer valid and the whole execution should be retried in a new object. And really, I think once exit_status() is called, that should be it regardless if it returned a value or timed out.

I think if the SSHProcess object is still around after calling exit_status() and tries to call exit_status() again, exit_status() should check if it no longer owns the lock and if not, should throw an exception.

We can talk more about about this if this isn't really making sense or if you disagree :slightly_smiling_face:

ricab commented 2 months ago

Hey @townsend2010, thanks for the feedback. OK, I think that equates to what I had in mind with my last option ([...] invalidate the ssh process and refuse to look for the exit code again [...]). The principle makes sense to me, but I would still like to discuss the details and I need to see how to fold that with the SFTPServer (I think it asks for the exit_code twice today).

ricab commented 2 months ago

Hey @townsend2010, I did something along the lines we discussed, although instead of "invalidating" subsequent exit_code calls I just return the result from the first call, be that an actual exit code or one of two new Exitless exceptions. This is more in line with the principle of not discarding results of computations. It avoids yet another exception while preventing any attempts to actually fetch the exit code again. The session is unlocked by then and SSH isn't used again. The new "peeking" method also saves the exit code if it gets one (but not exceptions).

There is one commit that I am quite unhappy about: https://github.com/canonical/multipass/pull/3457/commits/becb2c58a635dcc6ed95140497c6f1b0dea6db8c. Doing better would require rewriting many sftp/sshfs tests. That would deviate too much from what we're after here. Unless you find an easier way to adapt? I found that ExitStateMock does not have "enough resolution", and mocking MP_UTILS.run_in_ssh_session on top of the exinting premocks doesn't work without reworking what is there, let alone mocking SSHProcess/SSHSession directly.

Let me know what you think.

ricab commented 2 months ago

Thanks for the review @townsend2010!

I'd like to test this with the latest main and built as a snap, so could you please rebase this on main and then I will test? Thanks!

I am trying, but having trouble with Flutter at the moment (@andrei-toterman build instructions are sorely missed :wink: )

In the meantime, doesn't the snap build on a main merge?

townsend2010 commented 2 months ago

In the meantime, doesn't the snap build on a main merge?

Yeah, it is, so it's probably ok.

ricab commented 2 months ago

Alright, clean rebase @townsend2010 :point_up_2:

townsend2010 commented 2 months ago

Ok, cool, thanks @ricab!

ricab commented 2 months ago

You're welcome!

townsend2010 commented 2 months ago

Right, need to merge manually...