cmu-delphi / covidcast

R and Python packages supporting Delphi's COVIDcast effort.
https://delphi.cmu.edu/covidcast/
33 stars 27 forks source link

Parallelize covidcast_signal()? #279

Closed huisaddison closed 3 years ago

huisaddison commented 4 years ago

I noticed that covidcast_signal() can be slow; but the operation is easily parallelizable using a tool like parallel::mclapply(). (I realized this after I wrapped single-day covidcast_signal() queries in such a parallel operation, and I basically get a nice n_core times speedup).

I quickly sketched PR draft #278 to give an idea for how we could stick this inside the package, and if it is a good idea (reasons I can think of for it being a "bad idea": parallelism is finicky enough that maybe we prefer to defer it to an end user...) then I (or somebody else interested) can put some time into building, testing, etc.

capnrefsmmat commented 4 years ago

So #275 will independently speed up API requests for anything other than county level, by letting us batch requests across days.

Did you test this and find that it actually speeds things up? It seems like it would depend on what the bottleneck is. If you're downloading data as fast as your Internet connection can handle, this won't help. If the server is providing data more slowly than that, multiple connections will maybe get you multiple servers in our AWS pool. Not sure if we want to let individual clients have many simultaneous connections and perhaps overwhelm the pool.

huisaddison commented 4 years ago

I did see #275 and I think it's a great addition.

Anecdotally I did find a speedup by parallelizing (even on my rinky-dink 4 core laptop). I can provide some timings, varying n_core, to give some more color as to how useful this could be.

(If I may hypothesize why this helps - each day is a tiny amount of data (compared to, say, streaming a video), so it's nowhere near Internet bottleneck - I think the bottleneck is all of the post-download checks and processing we do for each day inside covidcast_day. I also sense that repeatedly opening and closing a server connection for each day is contributing to slowdown in serial, but I may be wrong).

FWIW, this is the function I originally wrote (in which I wrap the covidcast_signal() call in mclapply):

download_delayed_data = function(source_name,
                                 signal_name,
                                 start_day,
                                 end_day,
                                 geo_level,
                                 n_delay) {
  iterdays = 0:(end_day-start_day)
  if (!PARCOMPUTE) {
    lfunc = lapply
  } else {
    lfunc = function(x, f) { parallel::mclapply(x, f, mc.cores = N_CORE) }
  }
  bind_rows(lfunc(iterdays, function(dt) {
                      suppressWarnings(
                      covidcast_signal(source_name,
                                       signal_name,
                                       start_day+dt,
                                       start_day+dt,
                                       geo_type=geo_level,
                                       as_of=start_day+dt+n_delay))
  }))
}

The PR was to illustrate that it shouldn't require much code change to move the parallelism into the package itself, and to provide a goodwill first stab at implementing (so I'm not just sitting here asking for new features :) )

capnrefsmmat commented 3 years ago

@huisaddison The speedups in #275 are now available on the r-pkg-devel branch if you'd like to test them out. Then we can evaluate if parallelization is beneficial or if the improvements in #275 are sufficient. They'll make state, MSA, and HRR-level downloads much faster, by packing in more days per API call; county-level estimates for dense signals like cases and deaths will still take roughly the same time as before.