flightstats / hub

fault tolerant, highly available service for data storage and distribution
http://www.flightstats.com
MIT License
103 stars 35 forks source link

Map/Reduce-style Computations Over Timeseries Data #836

Open crozierm opened 7 years ago

crozierm commented 7 years ago

There are an increasing number of use cases where - largely for research, prototyping, and historical curation - we need to perform aggregation and transformation operations over large swaths of data stored in hub channels. While we've traditionally performed transformations and aggregations like this with long-running singleton "channel crawlers", we would like to exploit a more parallel approach, using tools like Lambda, Spark, Beam, etc... exploiting the horizontal scalability of AWS, but without impacting Hub performance.

One easy example is with historical flight position data, where we'd like to aggregate the individual positions into documents based on a variety of criteria:

There is a relatively large amount of this data:

50 million requests isn't crazy, but obviously there is a point where if we overly-parallelized, we might effectively DoS the hub. We'd like to know what a safe parallelization level currently is.

If there is a concern about higher parallelization factors, we'd like to request a feature where we can initiate very high volumes of historical (non-spoke) reads. For instance, perhaps there is a way where instead of reading from the hub directly, we request that the Hub perform a highly concurrent "spray" of payloads into our queue or Spark cluster, so that the Hub has an opportunity to first scale itself up or otherwise limit the activity a level where it doesn't interfere with more time-sensitive work.

tlehman commented 6 years ago

Have we (at least tentatively) decided on Amazon Kinesis for this?