Map/Reduce-style Computations Over Timeseries Data

There are an increasing number of use cases where - largely for research, prototyping, and historical curation - we need to perform aggregation and transformation operations over large swaths of data stored in hub channels. While we've traditionally performed transformations and aggregations like this with long-running singleton "channel crawlers", we would like to exploit a more parallel approach, using tools like Lambda, Spark, Beam, etc... exploiting the horizontal scalability of AWS, but without impacting Hub performance.

One easy example is with historical flight position data, where we'd like to aggregate the individual positions into documents based on a variety of criteria:

by flight history id, if any
by hex code and date
by tail number and date

There is a relatively large amount of this data:

one channel for each of three sources
500-1000 payloads per-hour, per-channel
each payload is a batch of up to a few hundred positions, each for a different flight
we will be operating over roughly 28 months of historical data (at most probably 50 million hub payloads?)

50 million requests isn't crazy, but obviously there is a point where if we overly-parallelized, we might effectively DoS the hub. We'd like to know what a safe parallelization level currently is.

If there is a concern about higher parallelization factors, we'd like to request a feature where we can initiate very high volumes of historical (non-spoke) reads. For instance, perhaps there is a way where instead of reading from the hub directly, we request that the Hub perform a highly concurrent "spray" of payloads into our queue or Spark cluster, so that the Hub has an opportunity to first scale itself up or otherwise limit the activity a level where it doesn't interfere with more time-sensitive work.

flightstats / hub

Map/Reduce-style Computations Over Timeseries Data #836