fortanix / rust-sgx

The Fortanix Rust Enclave Development Platform
https://edp.fortanix.com
Mozilla Public License 2.0
433 stars 98 forks source link

Async usercall interface for SGX enclaves #515

Closed vn971 closed 8 months ago

vn971 commented 1 year ago

Entering and exiting an SGX enclave is performance costly. It's much more efficient to continue executing within the enclave and communicate with the enclave-runner by passing messages. The tokio runtime can be used for such asynchronous communication. This PR provides very basic support for this in EDP, but changes to mio and tokio still need to be upstreamed. These changes are fully backwards compatible; your existing enclaves will continue to run as expected.

Credits for this PR go to: Mohsen: https://github.com/fortanix/rust-sgx/pull/404 YxC: https://github.com/fortanix/rust-sgx/pull/441

This commit is an attempt to have the async-usercalls finally merged into the main codebase (master branch).

raoulstrackx commented 11 months ago

Added one last comment related to MakeSend and ticket #530, but will approve merge once this passes further testing.

DragonDev1906 commented 11 months ago

Short Questions (I have not read all the changes):

At the moment I'm not sure what exactly is meant with async usercall interface.

raoulstrackx commented 11 months ago

Good question @DragonDev1906, I've updated the description of this PR to make things more clear. Let me know if you still have questions. This PR doesn't have examples, but we'll add some once the changes to mio and tokio have been upstreamed and things are easier to be used.

DragonDev1906 commented 11 months ago

Nice, I've had a few issues with dependencies that rely on tokio with the net feature, which made it impossible to use them. Thank you for the clarification.

I do have two more questions (though I'm not sure if this is the right place to ask them): At the moment I only have sync code in the enclave, with a custom runner (using tokio and handling tls termination where I don't need it in the enclave) responsible for pushing data received from other systems to the enclave. Basically I'm just sending a continuous list of commands with data and process any results returned from the enclave.

I plan to test the throughput of those options, but perhaps you already have some experience or suggestions which option may be the slowest/most inefficient. Especially the library mode and if such a system would even benefit of the async usercall interface changes. (It could also be useful to have such a comparison of communication options somewhere in the docs).

(so many questions, sorry)

raoulstrackx commented 11 months ago

No worries @DragonDev1906

Will enclaves without async code, where the runner doesn't have to wait for the enclave to finish before sending the next command benefit from this change?

No, without changes to your code, this PR doesn't have any impact for you.

... which implementation is best for the above stated situation... i. Communicate via TCP (no custom runner needed), likely slow because it needs to go into kernel space

If you use the changes in this PR to build an async enclave, your code will be a bit more readable. Biggest change would be that you don't need to enter/exit the enclave to request new commands/return responses. If the enclave is compute expensive, the performance benefit of that may be minimal. Async code works best when it no longer blocks on I/O, but can do something useful while it waits for some event. Based on your description, you may already be doing that with a custom runner.

ii. Communicate via the existing Usercall Extensions

See previous answer

iii. Communicate via the async usercall interface (unless that only makes sense when the enclave itself runs async code).

Yes that only makes sense if the enclave runs async code

iv. Use the enclave in library mode.

That seems unrelated to whether you right sync or async code.

DragonDev1906 commented 11 months ago

Biggest change would be that you don't need to enter/exit the enclave to request new commands/return responses.

Just to see if I understood that correctly: The changes in this PR (when using the new async interface) are going to mean that multiple usercalls can/will be batched into a single ECALL (enter/exit), with the ability to use async code to send multiple usercalls without waiting for the response. But there still needs to be at lest one ECALL (for the entire batch) before the runner can process the usercall and the same for the way back, correct?


Just some info if you're interested, @raoulstrackx:

Based on your description, you may already be doing that with a custom runner.

Yeah, my enclave is not waiting for any responses for requests sent out (that's handled outside the enclave) and only blocks while trying to read new commands (currently via TCP) or writing results (also via TCP), but new commands don't depend on previous results unless something goes wrong.

If you use the changes in this PR to build an async enclave, your code will be a bit more readable. [...] Based on your description, you may already be doing that with a custom runner.

I've thought about implementing in a "enclave requests the data and waits for the response" way, where async usercalls would likely be a big performance benefit and/or be a lot more readable. My conclusion to that was that there is a rather big trade-of:

I'm not yet sure if this architecture is going to bite me at some point. It's good to know that there will be an efficient way to implement it in a "enclave asks for data" way should the need arise to do that because a complete separation of fetching and logic gets too difficult.

raoulstrackx commented 11 months ago

@DragonDev1906 sorry I forgot to reply to your comment.

The changes in this PR (when using the new async interface) are going to mean that multiple usercalls can/will be batched into a single ECALL (enter/exit),

Strictly speaking: yes, but I think you misunderstood a bit how EDP is expecting to be used. The idea is to run an entire application in the enclave. So the single ecall you refer to, is coming from the enclave-runner that calls the enclave for the very first time. This eventually leads to the enclave calling your main function within its boundaries. Then all usercalls can be done asynchronously from within the enclave. See also the enclave execution lifecycle

For questions/comments not specifically related to this PR. Let's switch to the #rust-sgx channel in the Runtime-Encryption Slack workspace