Open bartelink opened 6 years ago
I had not read this when I posted a similar sentiment in EventStore/EventStore#1586. Basically there are two scenarios I have. Request/response services and subscribers. The default ES connection seems tuned for reliability of subscriptions, but it may not be a great default for request/response.
Ironically, I don't really want to use subscribers as designed either, because of their push-based semantics. I had rather use pub/sub notification only, then let the service choose when to read full events with payloads for processing. For example, standard views would receive notification (data about the event w/o payload) and just track, say, the last stream position it saw. Then the service can hit the HTTP endpoint to pull the next batch of events when it has opportunity. You can use the current subscription design for this, but the API leads devs to try to actually process the event when they get notified. That's fine for some things, but leads to additional complexity like the need to handle overflow errors and subscription drops when processing takes too much time. I'm working on some arch changes, and I'm planning to have one TCP subscriber (per ES instance) posting just notification data to an SNS or IoT Message Broker topic, and let listeners choose when to hit the ES HTTP APIs to get the events they care about. I feel like this makes it easier to "do the right thing" when writing a service that needs to be notified of events. But we will see.
@bartelink, I moved that to the .NET TCP Client repo, as we moved it some time ago. The case you described is valid, we need to think about whether we'll have enough capacity to tackle that, as we want to fade out a TCP client in the near future.
(NB this is based on me reading source and some speculative extrapolations from prod scenarios I've been analyzing that I've yet to completly validate; I might be looking at this problem wrong and/or have misinterpreted the behavior and am happy to be corrected on this. But for now I'll barge in, assuming I'm correct in some underlying assumptions)
For me, the default semantics of the config, i.e. that the lack of a server leads to a retry chain by default, which one neutralizes via
FailOnNoServerResponse
are slightly surprising. While I can appreciate that in general in a well designed overall system, one should not be having any such calls in a request processing path, in some parts of the real world, such a view would would take some explaining.I can appreciate that the OOTB config if one does not use
FailOnNoServerResponse
being [as far as I can infer]:.SetOperationTimeoutTo(TimeSpan.FromSeconds(7)).LimitRetriesForOperationTo(10)
, would play out well in terms of making a batch processor tolerant of a network interruption.However AIUI, the impact of the default config in a request processing scenario is that a connection outage can lead to an accumulation of requests taking abnormal lengths of time (operation attempts (default 10) * operation timeout per request (default 7s)), which can swamp a request processing pipeline which does not have adequate backpressure effects dampening this.
I could make a case for flipping the default behavior to be a
FailOnNoServerResponse
, but won't as it's both significant and arguably just not the right approach (I'm not sure how I feel about defaults that trigger exceptions during a reasonably correct use of the system in order to Just in Time trigger one to consider whether one should definitely opt into a behavior one might not have considered).What I'd suggest instead is to at least offer a
DoNotFailOnNoServerResponse
(there is precedent for having both sides of a bool with a default inPerformOnAnyNode
vsRequireMaster
). This would allow the xmldoc and the docs pages to enumerate the key aspects of choosing a particular config more explicitly than it presently does.If some direction can be provided as to whether this is a good idea, I'm open to doing a PR to explain exactly what I mean.