eiffel-community / eiffel-remrem-generate

Apache License 2.0
8 stars 70 forks source link

ER lookup retry strategy is flawed #152

Closed magnusbaeck closed 2 years ago

magnusbaeck commented 4 years ago

Description

When performing ER lookups, a connection problem or non-200 response from the ER will result in an immediate retry of the operation, and if that also fails we'll continue and treat it as any empty response.

https://github.com/eiffel-community/eiffel-remrem-generate/blob/2.0.9/service/src/main/java/com/ericsson/eiffel/remrem/generate/controller/RemremGenerateController.java#L146-L159

This is flawed in multiple ways:

REMReM should either give up immediately or make an honest attempt at retrying the operation, which might be hard given that it's doing the lookup in response to an RPC that usually carries a client-side timeout. It might be acceptable to make, say, 4 attempts 5 seconds apart, but after that we should give up anyway since a situation with the client giving up on us isn't helpful to anyone.

Giving up right away and responding with a 503 status code is probably best; clients need to be prepared to deal with such errors anyway, and we're not doing much work before making the ER lookup so we won't be wasting CPU cycles by pushing the retries to the client. As a bonus we can simplify RemremGenerateController.java and won't have to think about writing tests for the retry functionality.

Motivation

Clients using the ER lookup functionality must be able to trust that REMReM Generate behaves correctly when the ER is unavailable.

Benefits

Clearer failure semantics for clients and simpler REMReM Generate code.

Possible Drawbacks

None I can think of.

erik-edling commented 4 years ago

Hello Magnus, thanks for the contribution of this issue. Sorry for the late reply. I do agree with you that REMReM currently is flawed in this aspect, as like other components. This is an aspect that absolutely should be looked into and i would like to add that the circuit breaker pattern might also be a candidate as that is frequently used in service/microservice based architecture for this very purpose. What's your thought on this? 😃

magnusbaeck commented 4 years ago

I agree that the circuit breaker pattern would be useful to apply here as long as the complexity of implementation and testing is kept low.

zburswa commented 3 years ago

Hi Magnus, I will take up the issue and will improve the retry mechanism for the ER lookup.

m-linner-ericsson commented 2 years ago

@magnusbaeck Do you consider https://github.com/eiffel-community/eiffel-remrem-generate/pull/186 solves this issue?

magnusbaeck commented 2 years ago

Yes, thanks. Closing issue.