Error management and parallelization in Pull and Push forms

ggalmazor commented 5 years ago

This issue tries to document the root causes of some Sentry error reports we've been having. The following notes gather what has surfaced while studying these error reports, which involves how we deal with parallelism and errors in the pull and push operations, among other things.

We currently use a combination of executor services, futures and callables.
Executor services are initialized arbitrarily, as opposed to how the more up to date ForkJoinPool works, which automatically adapts to the host machine's resources.
Information about the specific checked exceptions that the callables throw get lost because the executor services deal with the Exception superclass only.
It's not clear how exceptions occurring in secondary threads from the executor services escalate to the main thread, nor how to control them before they reach it.

It's going to be hard for us to fine tune error management around pull and push in the current scenario because investing time and effort in improving what now (after Java8) has become an obsolete parallelism and error management paradigm doesn't look like the best thing to do.

Going forward, I'd propose we refactor Briefcase to:

Move away from checked exceptions
Use ForkJoinPool/parallel Streams for parallelism
Use the API around CompletableFuture for error management and operation cancellation

We have already made a step in this direction regarding the export operation. Doing this on the pull and push operations feels like the right thing to do.

batkinson commented 5 years ago

The issue doesn't offer enough context for me to know what those reports you're talking about are really about. However, with what limited context I have from this: you may want to be careful with some of your assumptions. For example, I know that the concurrency for the pull operation (at least at the time I wrote it) was not chosen arbitrarily. The concurrency limit was chosen to give some benefit of concurrent pulls, while capping to a limit that was difficult to change without some effort (yes, intentionally if you can believe it). The multiplier chosen was chosen after testing alternatives on large workloads against Aggregate. Mitch warned that going concurrent could cause problems (at least with Aggregate on AppEngine) and that push should be serial. Anyway, the key point I want to make is that you should be careful to look at this from the system-perspective in addition to just looking at an app (Briefcase) in isolation. One or more fully concurrent briefcase clients could very easily make a request-per-thread server unresposive. The teams I am supporting often have forms with submission counts in the 100k range in production (which is why we needed to speed up pulls to begin with). Anyway, just a heads-up.

batkinson commented 5 years ago

Also, I forgot to mention that if you intend to use persistent http connections to limit the connection negotiation overhead (which I would recommend), you'll need to make sure whatever scheme you use has enough persistent connections to work with. Also be aware that thare may be server-enforced limits on that number per client.

ggalmazor commented 5 years ago

Thanks, @batkinson, this is very valuable information!

Please, rest assured that I'm not looking into this from Briefcase's perspective only. I know there have to be some constraints in place while interacting with Aggregate, although we should start looking at ODK Central as well.

In any case, the goal of this issue would be to address problems with the design and tools used for solving parallelism and error management, not to change the operative constraints of Briefcase (at least while they're not proven to be wrong, which is not what I'm trying to do here).

The concurrency limit was chosen to give some benefit of concurrent pulls, while capping to a limit that was difficult to change without some effort (yes, intentionally if you can believe it). The multiplier chosen was chosen after testing alternatives on large workloads against Aggregate.

I'm confused after reading this part of your comment, though. Currently, when pulling forms, we're initializing the executor service with number of cpus x 3 threads. This is the specific code that deals with that:

// WebUtils.java
static final int MAX_CONNECTIONS_PER_ROUTE = Runtime.getRuntime().availableProcessors() * 3;

// ServerFetcher.java
private ExecutorService getFetchExecutorService() {
  int downloadThreads = BriefcasePreferences.getBriefcaseParallelPullsProperty()? MAX_CONNECTIONS_PER_ROUTE : 1;
  return Executors.newFixedThreadPool(downloadThreads, new DownloadThreadFactory());
}

In my machine, this would mean 36 threads. I don't understand how this makes sense in terms of putting in place some safe cap to prevent overloading an Aggregate instance.

Also, I would presume that the cap should depend on GAE instance types, number of instances, etc.

I'm sure I'm the one not understanding something.

Could you elaborate?

batkinson commented 5 years ago

That looks a bit crude, right? Admittedly, it is.

My point was that it was not arbitrary. I suspect there is more visibility into common deployments now, but at the time it was difficult to guess where and how the client was being used. My follow-on work later lead me believe it may not have been used widely due to some serious defects. Anyway, that calculation was an attempt to make it work on low-cpu count machines (VMs aren't likely to have 12 cores dedicated to them) and making sure a single machine can't swamp a server (in which case 36 threads is still under what a typical server can manage). It was meant to be relatively safe, but providing some speedup on most systems, without providing an option for users to increase (concurrency is faster, and faster is always better, so I should always max it out! Turbo-power!). A full-blown UI around tuning concurrency settings or even user settings didn't make sense for a feature that I sort of felt discouraged to implement since it was deemed as a bit "naughty" at the time. Things may have changed since then, whether by change of guard or backend assumptions: AppEngine or not, Aggregate or Central.

To add more context, I am not using an ODK server backend, and so it doesn't handle writes like Aggregate on AppEngine. Because of that we have had a fork that allows concurrent push as well (though push is significantly slower in practice).

Anyway, I wasn't trying to discourage attempting to improve on what is there. I just wanted to make you aware that the choice wasn't arbitrary. It was meant to be a simple toggle, while attempting to balance the multiple constraints.

ggalmazor commented 5 years ago

Thanks again, @batkinson! Your comments are not discouraging at all! Sometimes it's hard for me to reach the context to understand some code decisions and these discussions help me a lot.

batkinson commented 5 years ago

Sometimes it's hard for me to reach the context to understand some code decisions and this discussions help me a lot.

I am sure anyone that's had to work on a legacy system totally understands your perspective. I know I do, which is why I figured chiming in might be helpful. I would love to see better treatment of concurrency and I know what you're up against. It was not easy to wedge it in without making more comprehensive changes. Good luck.

getodk / briefcase

Error management and parallelization in Pull and Push forms #677