Closed magnusbaeck closed 2 years ago
Hello Magnus, thanks for the contribution of this issue. Sorry for the late reply. I do agree with you that REMReM currently is flawed in this aspect, as like other components. This is an aspect that absolutely should be looked into and i would like to add that the circuit breaker pattern might also be a candidate as that is frequently used in service/microservice based architecture for this very purpose. What's your thought on this? 😃
I agree that the circuit breaker pattern would be useful to apply here as long as the complexity of implementation and testing is kept low.
Hi Magnus, I will take up the issue and will improve the retry mechanism for the ER lookup.
@magnusbaeck Do you consider https://github.com/eiffel-community/eiffel-remrem-generate/pull/186 solves this issue?
Yes, thanks. Closing issue.
Description
When performing ER lookups, a connection problem or non-200 response from the ER will result in an immediate retry of the operation, and if that also fails we'll continue and treat it as any empty response.
https://github.com/eiffel-community/eiffel-remrem-generate/blob/2.0.9/service/src/main/java/com/ericsson/eiffel/remrem/generate/controller/RemremGenerateController.java#L146-L159
This is flawed in multiple ways:
REMReM should either give up immediately or make an honest attempt at retrying the operation, which might be hard given that it's doing the lookup in response to an RPC that usually carries a client-side timeout. It might be acceptable to make, say, 4 attempts 5 seconds apart, but after that we should give up anyway since a situation with the client giving up on us isn't helpful to anyone.
Giving up right away and responding with a 503 status code is probably best; clients need to be prepared to deal with such errors anyway, and we're not doing much work before making the ER lookup so we won't be wasting CPU cycles by pushing the retries to the client. As a bonus we can simplify RemremGenerateController.java and won't have to think about writing tests for the retry functionality.
Motivation
Clients using the ER lookup functionality must be able to trust that REMReM Generate behaves correctly when the ER is unavailable.
Benefits
Clearer failure semantics for clients and simpler REMReM Generate code.
Possible Drawbacks
None I can think of.