GDATASoftwareAG / vaas

Verdict-as-a-Service SDKs: Analyze files for malicious content
https://www.gdata.de/business/security-services/verdict-as-a-service
MIT License
36 stars 10 forks source link

Rust/fix concurrent requests #634

Closed pstadermann closed 2 weeks ago

pstadermann commented 3 weeks ago

Please review carefully, if you time and look out for code that is not idiomatic rust.

secana commented 3 weeks ago

What does it fix? The title says "concurrent request". But from the code I'm not able to see where is issue was that is fixed here.

secana commented 3 weeks ago

Do we want to introduce logging in a library? That's not uncommon in Rust, but I'm not to fond of it, as the library-user has to manually disable/set the tracing for each library that is used. That leads to beautiful code like this: https://github.com/kellnr/kellnr/blob/df94e8894948094776cbe379c9e683cb07cfdf9c/crates/kellnr/src/main.rs#L202

My expectation for a library is that errors are bubbled up as values and than logged by the library users. But that's not a hill I'll die on.

GermanCoding commented 3 weeks ago

The issue fixed here is that the current implementation is prone to data races when multiple concurrent requests are issued. The broadcast channel used doesn't have strong enough ordering guarantees (the order in which we call subscribe does not necessarily match the order in which responses come in), and as such, verdict responses may inadvertently get mapped to incorrect requests. Also, the implementation has a short time window where the request is already sent off, but the broadcast channel hasn't been subscribed to yet. If the web socket connection is fast enough, this results in a race where the entire verdict response is either dropped or routed to the wrong request. The entire approach is not sound as-is.

Also, the old channel tends to quickly lose responses when issuing requests at a moderately (~10 req/s) high rate due to the broadcast channel filling up (this used to be configurable, but the new approach fixes the problem altogether).

We cannot bubble up errors within the web socket, as this is code that isn't externally exposed in any way. We only have the choice of either suppressing the errors silently, straight up panicking, or logging them. This PR chooses the latter option so that there's at least some chance of us seeing problems if they do occur. The application can configure the logging if they want.

secana commented 3 weeks ago

Thanks for the explanation. The mapping of the request is done based on the guid of the message and not based on the ordering. The ordering can be random. I'm not sure if the issue the PR tries to fix exists in the first place. As far as I can see, subscribe is always called before the message is send, so that does not seem to be an issue. But maybe I'm overlooking something. Just want to understand the issue, even if the new code is cleaner and easier to reason about.

I switched from an unbound channel to a bound one as the backend died when too many requests were send. I used the channel capacity as a cheap back-pressure mechanism. If that's not the case anymore, we do not need the safe guard in the client anymore and can switch to an unbound channel, or as in this case a one-shot. Not sure if we run in timeouts if we fully utilize the network connection as we allow infinite parallel requests, but that's easy to test.

I see the difficulty with the error handling, but need more time to think about a possible solution. Maybe there is a smart way to get the errors out, but that does not have take place in this PR.

I like that the PR makes the code a lot cleaner and centralizes the logic for sending and waiting for messages. I'll give it test, when I'm back in the office.

GermanCoding commented 3 weeks ago

The issue is that the order in which subscribe is called is not necessarily related to the order in which requests are uploaded or replied to by the server - if two requests are submitted simultaneously, we call subscribe() in order A, upload in order B, and get verdict responses in order C. There is nothing to ensure that A = B = C. (The order of upload technically doesn't matter, as long as A = C holds). The new code maps by GUIDs instead, which ensures that the order becomes irrelevant and fully parallel processing is possible.

pstadermann commented 2 weeks ago

What does this fix: The current version of the VaaS rust SDK fails sporadically, if one instance is used concurrently. This becomes apparent with the new gscan release (see next PR).

pstadermann commented 2 weeks ago

The oneshot implementation is a lot cooler

We know :)