NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
592 stars 165 forks source link

[2.4] Release RM request receiver after finished #2667

Closed yanchengnv closed 1 month ago

yanchengnv commented 1 month ago

Fixes # .

Description

This PR fixes a potential memory issue in the receiving side of the ReliableMessage.

When a request is received, a RequestReceiver object is created and kept in a table. Currently, after the request is finished, we still keep the RequestReceiver object in the table. RM has a monitoring process that will eventually release the object from the table. However this could take many minutes (depending on how tx_timeout is configured) before this happens. The problem is that all reply messages (could be very big) will remain in the table until the monitoring process remove them. This could potentially cause OOM if there are many requests in quick succession.

This PR solve this problem by releasing the RequestReceiver object immediately after the request is done and replied successfully. For requests that cannot be replied successfully, the RequestReceiver object will still remain in the table such that the requester can query it later. But the object will only stay in the table for tx_timeout seconds.

Types of changes