Analytics querying performance (aka, think about what to do with analytics)

Due to a rather significant spike in API hits, our analytics database has suddenly grown by quite a bit for this month. This is negatively impacting the performance of the analytics queries and graphs in the admin tool. I've increased our hardware sizes to get things generally working, but things can be quite poky the first time an admin performs queries until the caches spin up.

So one issue is that we perhaps just need to look into tuning our ElasticSearch queries or the ElasticSearch database itself. In particular, the "Filter Logs" view could possibly be optimized, since it's really performing several different requests in parallel (one date histogram for the chart over time, more aggregations for the "top X" IPs/users, and another for the most recent raw logs for the table). So that's probably worth a look, but I'm not too sure there's much different we can do with those queries if we want all that information on one page.

The more general issue is that I think we need to take a longer look at how we handle analytics data, since with our current approach this problem will only get worse with more usage, and I'd like to have more of a plan other than throwing more hardware at it when things break (or if more hardware is the plan, that's fine, but I'd at least like to have that planned a bit better in terms of when we need it, costs, etc). So a few random notes on that front:

We currently log every request into ElasticSearch and then perform aggregations on the fly depending on the queries. We're pretty much doing what the ELK (ElasticSearch/Kibana/Logstash) stack does, but slightly customized for our API use-case. We also store logs indefinitely (so for many years, rather than I think a more common ELK use-cases where it's just a couple weeks or a month of raw log files before you prune). While we've had a spike in volume, it still doesn't seem like anything too unreasonable, so it would be interesting to know what hardware requirements other people typically use for ELK installations of our size to see if we're just doing anything silly.
We could dramatically speed things up by pre-aggreating and binning our analytics, but my struggle with that approach has always been that we don't always know how we want to slice and dice the requests in the future, so storing all the raw logs has come in quite useful. And even in some cases where we have a good sense of how to pre-aggregate the requests (like total hits per hour/day), that's complicated by the fact that agency admins should only be able to see hits for their APIs. Right now, it's simpler to query all that on the fly, since it makes the permissions easy and fluid, but if needed, we could probably devise a system to pre-aggreate common things.
This latest spike is interestingly mainly caused by one user making a ton of over-rate-limit requests. On the one hand, our rate limits are being effective, which is good, but on the other hand, it means we're logging a ton of over-limit error messages. This is an interesting problem because I definitely found our logging of all these over rate limit errors helpful at first (since it helped pinpoint where all this additional traffic was coming from), but at this point logging all of those continued errors seems like a lot of noise.
As an alternative, I've been intrigued by InfluxDB and whether it might be better suited to the task over ElasticSearch (since this use-case seems exactly what they're targeting), but it's still quite new. I'd also be interested in how Postgres or Cassandra performs in comparison with similar volume (there's also some interesting stuff happening in Postgres with columnar store plugins that might be better suited for this, but most of those seem like commercial options at this point).
As another alternative, there's always the option of external providers for analytics, like Google Analytics or Keen.io. If someone else wants to figure out these more difficult problems, I'm all for that, we'd just need to determine if it would meet our requirements, the cost implications, and if we're okay with offloading a portion of our stack like that.
- I thought we had a GitHub issue discussing Google Analytics already floating out there, but I couldn't find it. In any case, now that Universal Analytics is out there, I think storing API metrics out there is more feasible, but there are still things to consider (rate limits on their end, pricing, and the fact that we couldn't store api key/user information).
- Something like Keen.io might be a bit more flexible, but that would require more research, budget, etc. We'd be above their "Enterprise" pricing plan this month and into "Custom" plan territory, but I did notice a little blurb on their pricing page about helping out with open source and open data, so that might be an interesting conversation to have with them if we wanted to pursue this.

@cmc333333: Just to followup on our conversation, Druid was the other interesting database I had stumbled upon in this space.

We've decided to take a look at this, since analytics performance does suffer with more traffic and is probably the biggest pain point in the admin.

I've spent part of this week exploring some options and ideas in this space. My thoughts on the best plan has changed pretty much each day, so unfortunately I think my thoughts might be a bit jumbled, but I want to at least get some of these ideas out there for feedback.

Background

I described above how we're currently logging things with ElasticSearch, but here's a bit more technical details about our current setup:

We run 2 m4.xlarge servers for ElasticSearch (all data is present on both servers). So we're not talking super-beefy hardware or a big cluster.
We're storing around 750 million log entries dating back to 2010 which consumes around 550GB of disk space on each server (but since the data is on both servers, it consumes a little over 1TB in total).
Here's a chart showing how many hits we've had each month, along with how much disk space storing the analytics for that months takes:
Here's a chart showing the cumulative hits and per server storage space used:

Feature Requirements/Desires

It would also help to give a brief overview of the various things we're currently doing with analytics. In some ways, I think we have a variety of competing requirements which is what makes all this a bit tricky to tackle. I'm certainly open to simplifying or removing functionality if it turns out nobody is using it.

I also lucked out this week and there was a small flurry in analytics activity from NREL folks needing to pull various things. So I interviewed a few NREL people on whether they were able to easily get the analytics they wanted out of the current system, or what issues they ran into. It was a small sample size of users, but I think it at least helped demonstrate some of the concrete use-cases from average admins.

So here's some use-cases/requirements/what we're currently doing:

Long-duration summaries versus short-duration summaries:
- I think probably the more common thing people want are simple aggregates over longer periods of time. For example: how many hits per month did my API have for the past year? Or who are my top users in the 2 years? Or what was my my most popular API for each month in the past quarter?
- Unfortunately, these longer duration aggregates are precisely the thing that we struggle the most with. ElasticSearch does a decent job of answering these fairly quickly, but since we're fundamentally dealing with individual request data, you can see why these queries over long periods of time are computationally intensive and slow.
- How many months worth of data you can pull aggregate numbers for is pretty variable depending on various factors (namely the filters you have in place, how much traffic those APIs had during that time period, and whether ElasticSearch's caches are warm), but it's probably around 3 months - 6 months. If you try to pull more months than that, things can go into a bit of a tizzy, since the requests will timeout, but ElasticSearch will continue to try to process the requests in the background. So the failure behavior isn't really ideal either.
- Based on talking to the NREL users this week, this was probably the biggest pain point for them (I can also attest to this from my own experiences too). They were typically wanting summary counts for longer durations (3 months - 2 years), but after experiencing timeouts, they would have to revert to pulling the numbers for smaller windows of time separately (eg, 6 months at a time).
- On the flip side of these long-duration aggregates are short-duration summaries. While I suspect most people want daily/weekly/monthly counts for high-level reports more frequently, we also see value in being able to look at more granular things, like hits per-minute over time.
- These per-minute charts can be quite helpful when debugging traffic spikes or investigating API abuse.
- Currently, you can only view the per-minute chart for 2 days at a time, but right now we support viewing any arbitrary 2 days in history.
- Typically, I've only ever used the per-minutes charts for looking at traffic within the past few days (since it's usually related to investigating something that's going on or recently happened), so while we currently have the data indefinitely, I don't think we would necessarily need to be able to support these charts for all of time. Probably capping per-minute data at the last 2 weeks would be plenty.
- Per-hour charts are a similar story, but I'd say those can be more useful over longer periods of time for understanding daily traffic patterns. However, I still can't imagine much use for hourly charts older than 2-3 months.
- In interviewing the NREL users, nobody this week needed this kind of granular data. One user has used them in the past, but another has no use for them ever. I do use these charts periodically myself, so I do find them useful.
- If I were the only one using them, I'd say we should get rid of them, but I know I've at least walked a few other agency users through using them when we've been looking at API abuse or load spike issues. So I think they are useful, but I think these are usually for a slightly different audience (people more on the development/operations side of APIs, rather than people looking at the big picture value of APIs). We could decide that this isn't our business to provide this kind of level of detail, and leave it up to the API backends, but I think without us providing these metrics, some things would be difficult to investigate (like API key abuse, since the API backend doesn't see all the traffic we do).
- It's actually been on my personal wishlist to allow for per-second charts for recent data too, since sometimes the per-minute isn't granular enough.
Aggregate counts versus individual request data
- We've talked about aggregate counts above. This includes high-level sums, averages, etc, over some period of time. So this includes things like: How many hits did I have each month in the past 6 months? Or what was my API's average response time in the past month?
- Again, these kind of aggregates tend to be more computationally intensive and slower to calculate over longer periods of time (since there's more work involved in filtering and aggregating all the underlying logs).
- Everyone I talked to at NREL is interested in this kind of summary information.
- On the opposite side, we have the raw, underlying log data for each individual request.
- We expose this and make this downloadable via the "Filter Logs" screen (basically, the top half is aggregate counts in the charts, and then the bottom half is raw log data).
- This is definitely an area I've been wondering about whether we need to store all this raw data indefinitely. It gives us a lot of flexibility, but it leads to most of our storage demands (storage isn't super expensive, but it's still something I'd like to try to keep in reasonable check).
- One of the NREL users did use this functionality this week to download the last 3 years worth of raw logs for the PVDAQ API. They did this so they could do analysis on how the system_id query string parameter was being used (basically, figuring out which systems were most frequently being called via the API). This is just one data point, but it does demonstrate a potential use-case.
- I'm not sure how commonly other admins have done anything like the above example of 3 years worth of raw logs is, but if we do want to support this, I'm not sure there are a lot of alternatives to storing individual request logs (I think it would be hard to roll things up in a more efficient manner and still allow something like the analysis on an arbitrary query string, like the system_id example above).
Real-time data
- It's worth noting that after an API hit is made, the analytics for that hit will be reflected in the admin tool within a couple seconds. So our view of the analytics is generally pretty close to real-time.
- I think one big question in looking at alternatives is how important is this real-time view of the data? I'd say real-time is certainly nice, but if things weren't real-time, how big of a deal would a 1 minute delay be? 5 minute delay? 1 hour delay? 1 day delay?
- I forgot to ask the NREL users about this explicitly, but for the types of reports they were pulling this week, I don't think real-time was particularly relevant (since they were mostly pulling historical data).
- I suspect for historical summary reports, real-time doesn't matter too much. However, if you're investigating load spikes, abuse, or monitoring a new API's launch, then real-time results (or as near as possible) are pretty useful.

High Level Ideas

So how do we make things better? Here are some key areas I was looking into:

Columnar storage: Since we're dealing with analytics queries, columnar storage generally seem to be a much better fit for our workload.
Compression: We could benefit from compressing most of our raw data, for a couple reasons: 1: At the expense of some extra CPU load, it will help alleviate disk I/O, which is probably more beneficial. 2: A lot of the older stuff isn't necessarily used frequently, and reducing storage costs is always nice.
Partitioning: We currently partition our data based on dates so that we only access the data for the date range of the query involved. This can help dramatically since it means we're not querying unrelated data outside of the timeframe. One area I'd like to explore is also partitioning the data by the hostname of the request. This approach would keep each agency data partitioned, which would be a nice characteristic so that one agency generating a lot of traffic would have much less of an impact on other agencies with less traffic.
Pre-aggregate summaries: No matter what fancy database technologies we employ, I think the crux of the problem boils down to the fact that we're currently filtering and analyzing all of the raw request data on the fly to supply aggregate numbers. This might be okay if every summary we were providing was different, but we definitely have some common, high-level summaries we commonly supply. If we pre-calculated these aggregate numbers, it would dramatically help us answer these questions much more quickly and efficiently (for example, how many API hits there were for each month in the past 3 years). Since I think we're interested in answering these type of questions more efficiently (and without needing a bigger cluster of machines), I think this is fundamentally what we need to do, but there are a few considerations:
- What dimensions or combinations of dimensions do we need to pre-aggregate on (for example, request host, request URL path, API key, IP address, response HTTP status code)
- Do we only allow querying on these pre-computed dimensions, or do we still allow for slower queries against unexpected dimensions to take place against the raw data? If we still allow for slow queries, how do we have two fundamentally different interfaces for fast vs slow queries, or do we try to provide a single interface and intelligently detect when we can use the fast aggregates and when we cannot?
- When do we perform the pre-aggregations? Do we do it on the fly by incrementing various counters for different dimensions as we log new requests? Or do we perform aggregate queries at some fixed interval (eg, every 5 minutes) and store those results (in which case, the aggregate counts may not always be live)?

Things I've Looked At

Here's a very quick rundown of various technologies I've looked into this week. Basically, there's a crazy amount of options in this space, all with subtly different characteristics and pros/cons. It also seems like this landscape is pretty quickly evolving, and some of these options are really new. I should heavily caveat my summaries with the fact that I haven't had a chance to benchmark many of these with our real workloads, so a lot of these are my (probably uninformed) thoughts just based on reading various things. In any case:

Hot Takes

Kylin: This takes a MOLAP approach to storing raw data and then providing pre-computed cubes to answer aggregate queries. The really intriguing thing is that I believe you can just write a SQL query, and the system will determine whether it can be answered by the pre-aggregated cubes or whether it will need to query the raw data. This single SQL interface into both types of queries is pretty appealing. The data cubes results aren't real-time, but I think can be refreshed intelligently on a pretty quick basis (I found references to eBay refreshing every 5 minutes). So there are definitely some pretty appealing aspects to this in solving our competing needs with a single solution, but it's still pretty new (although it comes from eBay where it seems like it's been deployed for a while), the documentation is a bit light, and the stack looks a bit involved in getting setup (both Hive and HBase needed).
Druid: Intriguing and seemingly built to address a number of the exact problems we face (efficiently handling event data while retaining the ability to slice & dice by arbitrary dimensions). However, my main concern is that we wouldn't see many benefits from its roll-up design because of the high cardinality of our raw data. I think we care about logging things like the full request URL (including query params) in combination with the user/API key that made the request, so it doesn't seem like we'd really see very many duplicate items in time windows that would rollup into a single event. It's also a somewhat significant stack to setup (I think a minimal production setup looks like 6 servers).
InfluxDB: Another new timeseries oriented database on the scene. It's completely self-contained which is appealing compared to some of the more complex setups, and I think continuous queries could help perform custom pre-cached aggregations on the fly. The downside is that I've read it may not be great at dealing with raw data with lots of dimensions, and it's still very new and undergoing pretty significant changes (they're building a new storage engine right now), which makes me a bit leery to jump on this for production use.
ElasticSearch 2: ElasticSearch 2 offers better compression options and defaults to doc values, both of which should have benefits (we never got around to turning on doc values in version 1). I'm running some tests to see how things compare. This could improve things, but I think we still need some sort of pre-computed aggregations solution.
Drill: SQL on top of Hadoop/HDFS. It aims for low-latency, so it's much faster than standalone Hadoop, but we're still fundamentally limited by the need to scan a bunch of data to answer simple questions like monthly hits for a specific API for the past 3 years (which we could distribute, but I don't think we're super keen on spinning up a big cluster of servers). But I think one interesting aspect of using this over ElasticSearch for storing all the raw data is using something like Parquet for storing all our raw data. It's not exactly a fair comparison, since these don't include all the indexes that ElasticSearch does (so things will be slower), but our October data went from consuming 70GB (x2 for both nodes), to consuming 7GB in a compressed Parquet file. That's a pretty dramatic difference, and opens up the door for storing the older data on something like S3, which has some nice characteristics (redundancy and cheaper). I think it will be slower at answering analytics queries than our current ElasticSearch setup (since we have everything indexed and local), but I think it may be less resource intensive and that might be an okay tradeoff if we were only using this to answer more selective historical queries.
Tajo: Another SQL on Hadoop option. My take is similar to Drill above, but it also allows for ORC files, which might be a bit better in terms of compression and performances.
Presto: Another SQL on Hadoop (or other backends) option. Similar to Tajo or Drill, but it might offer better query performance (also seems to have a pretty optimized ORC implementation).
Phoenix: SQL on HBase option. I haven't gotten much further than that, but HBase might be more appropriate than HDFS for storing the live data as it streams into our system (although it does appear we could append to HDFS files with something like Flume).
PostgreSQL: Postgres might be a good option for storing the pre-aggregated summaries, but I'm not sure it would be a good fit for storing all of our raw data given the size and cardinality of it. Here's a good read of how CloudFlare stores analytics with postgres, and their type of traffic would be similar to ours (HTTP requests). However, they use CitusDB (a commercial Postgres fork), which I think might help with the storage of raw data, but then they are also storing rollup data separately.
PostgreSQL & cstore_fdw: A columnar extension for postgres, which might help with storage and performance for analytics workloads.
MonetDB: A standalone columnar database. Here's also a comparison to cstore_fdw. The standalone nature looks appealing compared to some of the more complex hadoop setups, but if we needed to scale this out to multiple machines, it seems like the path is a bit fuzzier (although they added some distributed stuff this summer).
Prometheus: I don't think this would be suitable for storing the raw data, but it could be an interesting option for storing aggregate metric counters. One of the more interesting aspects of this versus other options in this space is the built-in alerting functionality, which we could possibly leverage to do interesting things with monitoring/alerting we've talked about.
Google Analytics: I'll throw this out there agin, because it would be nice to offload this to someone else. However, the main issue I think is volume of data (I'm not sure our volume is allowed under the current paid account), and the lack of api key information. Since we can't send user information to Google Analytics, we'd have no way to associate the data with api keys/users. Maybe this would still be interesting for aggregate counts, but in interviewing the NREL users this week, they were all interested in identifying the specific API keys that were users.
Piwik: Sort of like an open source version of Google Analytics. It's definitely geared towards website traffic, which makes it a bit of an odd fit in some cases, but I'll throw this out there because they do allow for a concept of user IDs. What's appealing is they've essentially figure out a lot of the combinations of things you might query on, and they are performing all the necessary rollups/pre-aggregations. It also provides a lot of dashboard functionality out of the box. The downside is that it's limited in some areas (for example, the most granular they get is 1 hour), it's not a perfect fit for API analytics (there are certain things we're more interested in that they don't aggregate on), and we'd have to figure out how to shoehorn this into our permissions system (it's a PHP & MySQL app, so it's also pretty different from our current stack).
Custom Pre-Aggreations: At the end of the day, rolling our own pre-cached aggregations may be the simplest option. I just don't want to reinvent a kludgy wheel if there are better options out there, which is why I've gone down into this rabbit hole of other options.
- I've given some thought to the various analytics dimensions we're mostly interested in, and I think I have a decent sense of what would help the most.
- I think the main questions then get back to how we would build this out and the questions I proposed above under "High Level Ideas" (namely, how do we integrate this into our existing interface which allows for querying by anything, and how do we integrate the collection of these aggregates).

Summary Thoughts

Well, I have some other summary thoughts, and I meant to re-read this, but in an attempt to get this out before our meeting today, I'll just toss this up in it's current state (which is probably littered with typos/incoherent babbling/trailing thoughts...).

@cmc333333, @gbinal: I'm waiting for some super-long running data processing jobs to finish (they've been running all day), so unfortunately I don't quite have a full status update on this analytics front today. However, I think I have a decent idea on a couple approaches to tackling this analytics stuff that I'll outline in more detail this week once I get some of those final test results.

But for some quick updates and summary:

I did get our postgres DBA to give us a sanity check on the raw-Postgres approach. He tried a few creative indexing approaches that may help, but in the end his discoveries largely mirrored what I had found: unless we can get the queries to always hit specific indexes, counts are going to be exceptionally slow. So unless we shift away from allowing ad-hoc queries, it doesn't seem like this is going to be a great fit.

He did point me at PipelineDB as another possibility, and while we don't need another option in the mix, PipelineDB actually offers some pretty interesting capabilities. It's continuous views are extremely similar to what InfluxDB offers, but PipelineDB is built on top of Postgres, which makes me more comfortable (versus the brand new storage engine InfluxDB is working on). This also allows for some interesting possibilities, like combining it with the cstore_fdw plugin for columnar storage of the raw data in Postgres (which I experimented with earlier and does seem more suitable for our workload).

But the main thing I've been looking at is Kylin, which seems like it might be best suited for our combination of known queries as well as ad-hoc queries. After more testing, it seems like it can handle all the types of queries we currently want to perform in an efficient manner using its pre-processed data cubes (which are easy to generate). The current test I'm running involves more data segmented on lots of dimensions, which is taking a very long time to pre-process. That makes me a little concerned about trying to handle real-time-ish data, but this is also sort of a worst-case scenario test, since the dimensions aren't configured in a very optimized fashion, and we may not actually need all these dimensions pre-computed (Kylin may also be releasing functionality to explicitly deal with real-time data later this year). But once this test finishes, then I'll have a better idea of where things stand.

My general sense is that using Kylin would allow for much simpler coding within the admin app, but at the sake of stack complexity (eg, more moving pieces and components that need to be running on the database server). On the flip-side, PipelineDB would probably be a smaller stack to setup, but it would involve more custom coding.

Whatever we do is going to involve more changes than simply picking database X, so my plan is to put together a more detailed outline of how exactly these options would be implemented (for example, how data gets ingested, how we partition the data, how we perform common queries, how we perform less common ad-hoc queries, how querying aggregate information works, how querying raw request records works, etc). I'd like to get those more detailed outlines together by this Thursday for both Kylin and PipelineDB. While I'm probably leaning towards Kylin at this point, I think having some other approach for comparison would be useful, and thinking through a couple of different approaches in detail would probably be valuable. But then hopefully we can chat about this on Friday and move forward with implementation.

Well, after more testing, I'm now waffling back and forth. But on the upside, I think we have a couple pretty decent options. There's really a multitude of options, but here's the quick summary of the couple of options I've focused on:

Kylin: Kylin pre-computes our aggregations using data cubes. Raw/historical data is queryable in Hadoop. On the downside, it requires a fully functional Hadoop+Hive+HBase environment, which is a pretty heavy-duty setup with a lot of components that add complexity. On the upside, that Hadoop ecosystem offers a clearer path towards scaling to significantly larger volume.
PipelineDB+cstore_fdw: PipelineDB computes aggregations on the fly using "continuous views." Raw/historical data could be queryable in the same database via the cstore_fdw plugin (columnar storage) for Postgres (since PipelineDB really is Postgres). On the downside, scaling this setup is a bit more difficult/unknown (PipelineDB just released a commercial offering last month that has some things to help with scalability). On the upside, it should scale quite a bit beyond the issues we're currently running into, and this stack is significantly lighter weight to initially setup and run.

(and cc @david-harrison who helped look at postgres, mentioned pipelinedb, and generally bounced ideas around--thanks!)

So what follows will probably be an incomprehensible, messy brain-dump after exploring these options for a bit this past week, but I figured I should make note of some of this. I'm not sure I have a firm conclusion yet, but I can give you the tl;dr when we talk tomorrow on the phone, and then we can hopefully move from there with a decision.

Analytics Queries

Since I'm not sure we've clarified this anywhere before, here's a quick summary of the different analytics screens we currently have and what they would need if we want to keep things the same (but again, simplifying or changing our interface and capabilities is always be on the table):

API Drilldown
- Presents the user with a list of results and a stacked time-based histogram chart.
- The first level provides a list of all the domains you have access to and the number of hits for that domain. The chart portrays this same information over time (with the stacking showing you the totals for the top 10 domains).
- If you drill into the domain, you then see a list of all the top-level URL paths (eg, /foo and /bar) and the number of hits. The charts show the same over time. Drill in again, and you see second-level URL paths for that parent path (eg, /foo/moo and /foo/boo). Repeat to whatever depth you want.
- Can filter the results by essentially any condition or field.
- Can change the date range.
- Can change chart to bin things by month, week, day, hour, or minute (minute only allows viewing of 2 days of data).
Filter Logs
- Presents a user with a list of raw requests that meet their filter criteria
- Includes a time-based histogram chart of the requests.
- Includes summary information of the requests meeting the filter criteria:
- Total hits
- Number of unique users
- List of top 10 users and the number of hits they've made.
- Number of unique IP addresses
- List of top 10 IP addresses and the number of hits they've made.
- Average response time
- Includes a table and downloadable list of all the raw requests meeting the filter criteria.
- Can filter the results by essentially any condition or field.
- Can change the date range.
- Can change chart to bin things by month, week, day, hour, or minute (minute only allows viewing of 2 days of data).
By Users
- Includes a table and downloadable list of all the users and the number hits they've made that meet the filter criteria.
- Can filter the results by essentially any condition or field.
- Can change the date range.
By Location
- Includes map showing the number of hits in each country of requests meeting the filter criteria.
- Can drill down into regions/states and then down to cities.
- Can filter the results by essentially any condition or field.
- Can change the date range.

Backwards Compatibility

Since the frontend only interacts with the analytics via our REST APIs, the basic idea would be to update those APIs in a backwards compatible way by replacing the ElasticSearch queries with whatever new queries we have. I think this should be relatively straightforward for any queries that use the form interface for building queries since we get the user's various filters as structured JSON conditions.

At some point, I think it would make sense to begin transitioning some of the admin APIs over to Lua to simplify our new stack, but for this iteration, I think we just keep all the APIs in the Ruby on Rails project to minimize change.

The main tricky part would be supporting the older "Advanced Queries" input, where we accepted raw Lucene queries. I don't think this is frequently used, so I'm not sure how far we want to go to support these queries in a backwards compatible way. But I know I have sent out some links that people have bookmarked using that query style (for lots of conditions, the form builder interface can be pretty cumbersome), and there are also certain types of queries you can't really achieve via the form builder (eg, range queries or queries on fields we still haven't exposed in the UI). In any case, if we do want to try to support the advanced queries, I found lucene_query_parser, which might be of use in transforming the lucene queries into something we could translate into our SQL conditionals.

Cardinality

What we're trying to do is perform pre-aggregations on all the most common fields we need to display for our analytics results. In the following examples, I've tried to setup realistic scenarios for what specific fields I think we need to pre-aggregate on in order to answer the bulk of our common analytic queries. However, it's worth calling out cardinality and how that interacts with some of our required filters and relates to performance.

Basically, high cardinality fields are the nemesis of pre-aggregation. If a field has high cardinality and the values are all different, then aggregating on that field doesn't really achieve any savings (since each value is unique). In these setups, I've included the URL path (so for example, on https://developer.nrel.gov/api/alt-fuel-stations/v1.json?api_key=DEMO_KEY&fuel_type=E85,ELEC&state=CA&limit=2, the URL path would just be /api/alt-fuel-stations/v1.json). We need to include this for a few reasons:

Our admin permissions are based on a concept of a URL prefix (for example, I might have permissions on api.data.gov/regulations/*), so we need some way of determining that filter for nearly any request for non-superuser admins.
We need to be able to efficiently query this for the API Drilldown report, where we're essentially traversing the hierarchy of the URL path.
Filtering to view just a specific API (based on the URL path prefix), is probably one of the most common custom filters applied.

However, this field causes us probably the most problems with high cardinality (and potential for abuse). User IDs/API keys and IPs are also up there, but the URL path has significantly more unique combinations than anything else in our data. The example above rolls up nicely (/api/alt-fuel-stations/v1.json), but where it begins to cause problems is when dynamic IDs are part of the path (which is pretty common in RESTful APIs), like /api/alt-fuel-stations/v1/72029.json, /api/alt-fuel-staitons/v1/70030.json, and so on.

It is hard to optimize around this problem given our current admin permissions and the fact that we don't know much about the underlying APIs (so we don't know about dynamic pieces of the URL or what really makes up a logical API that admins want to query on). I don't think we're necessarily to the point where this will cause problems, but it's an issue to be aware of. I have some ideas on how to possibly improve this, but without a more fundamental shift in what type of analytics we want to provide or how we manage APIs, I think this will be a potential problem with any solution.

Options

Anyway, onto the options! These are loosely how I think everything could work in our overall system with these two different databases. However, this is based on only some initial tests, so it's definitely possible some of my assumptions or understanding of things aren't quite right. But generally speaking, here's what I'm thinking might be possible:

With Kylin

Requirements

Kylin
Hadoop
Hive
HBase
Drill

Schema Design

Raw data is stored in a single Hive table partitioned by date (eg, YYYY-MM-DD) and request host (eg, developer.nrel.gov).
The partitions for the current day are stored in row-oriented, append-friendly files (ideally compressed sequence files, but perhaps just plain text files).
The partitions for all previous days are stored in compressed, column-oriented Parquet files (for much better querying performance and compression).

Ingest Process

Individual requests are logged from nginx and sent to a local Heka process running on the app server (this is currently what's happening).
Heka on the app server would then use the TCP Output mechanism to send the logs to a separate Heka instance running on the analytics database server.
Heka on the analytics database server would then write to log file on that server using the HDFS Output mechanism (ideally as a compressed sequence file).
Every 5 minutes, a job is created on Kylin to refresh the data cubes with the latest data.
Every day, the previous day of data is converted from row-oriented files to column-oriented parquet files. The daily partitions are updated to point to the new parquet files and the row-oriented files can be deleted.

Querying

All queries are written in SQL and executed via the provided REST APIs.
All queries are written as though they are querying the raw data. Kylin will answer queries that can be answered efficiently using the data cube results.
Queries that cannot be answered with the data cubes will instead fallback to querying the underlying Hive tables via Drill. This fallback capability used to be built into Kylin, but no longer is, so it's up to us to implement the fallback. Using Drill should provide a closer match to Kylin's ANSI SQL, and perform much better than straight Hive queries. The table partitioning setup should ensure the query automatically finds the correct data sources regardless of whether it's stored in the current day's sequence file or in a historical parquet file.

Cube Design

Dimensions
- request_at_hierarchy
- request_at_year
- request_at_month
- request_at_date
- request_at_hour
- request_url_hierarchy
- request_method
- request_scheme
- request_host
- request_path
- user_id
- request_ip_hierarchy
- request_ip_country
- request_ip_region
- request_ip_city
- request_ip
- response_status_hierarchy
- response_status
- gatekeeper_denied_code
- log_imported
Measures
- COUNT(1)
- SUM(timer_response) - Used for AVG(timer_response)
- COUNT(DISTINCT user_id)
- COUNT(DISTINCT request_ip)

Miscellaneous Notes

I'm not sure how quickly and efficiently we can really refresh the data cubes, so I'm not sure if the 5 minute latency is doable. There are some oddities around building new cubes only supporting date boundaries, and not timestamps, so I'm not sure how granular the refresh can be. I've found some references to eBay refreshing every 5 minutes in their use, but it's unclear if they're doing something like this, or taking a more custom approach.
Kylin's next release will include experimental streaming cubing, which could simplify the handling of live data, and eliminate our need to perform any manual refreshes. There's currently a release candidate out, and it sounds like it might be released at the end of this month (although the streaming will be marked as an experimental feature in that new release).
There was very recent discussion on the mailing list about possible integration with Amazon's Elastic Map Reduce service, which could simplify deployment quite a bit.

With PipelineDB+cstore_fdw

Requirements

PipelineDB
cstore_fdw PostgreSQL plugin

Schema Design

Raw data is stored in individual tables for each date (eg, YYYY-MM-DD) and request host (eg, developer.nrel.gov).
The tables for the current day are stored in row-oriented, append-friendly standard PostgreSQL tables.
The tables for all previous days are stored in compressed, column-oriented tables using cstore_fdw (for much better querying performance and compression).

Ingest Process

Individual requests are logged from nginx and sent to a local Heka process running on the app server (this is currently what's happening).
Heka on the app server would then use the Postgres Output mechanism to send the logs to two separate outputs on the same PipelineDB server: the ephemeral PipelineDB stream table, plus the normal Postgres table for daily persistent storage.
Aggregate data is updated on the fly in PipelineDB's continuous views by the inserts into the special stream table.
Every day, the previous day of data is converted from row-oriented standard Postgres tables to column-oriented cstore_fdw tables.

Querying

All queries are written in SQL and executed via native postgres connections.
Our app will have to determine which types of queries can be answered via the continuous views (faster) and which can only be answered by looking at the raw data (slower). This decision will be based on the columns being queried and what columns are present in the continuous views.
Queries that can be answered by the continuous views will need to be tailored for those views (eg, how the query performs group by and aggregations will be different).
Queries that must be answered by the raw data will need to query a UNION of all the distinct tables that apply to that query (the combination of tables that are "partitioned" by date and host)

Continuous View Design

I've largely mirrored the cube design from Kylin above, grouping by each dimension and including the same measures. It's definitely possible to structure things differently and with more separate views that might be smaller, but for now, I went with something similar to the cube design (more continuous views also slow down inserts, so you don't necessarily want to go overboard). I have been creating multiple views to group by each distinct time period (eg, date_trunc('hour', request_at), date_trunc('day', request_at), date_trunc('month', request_at)).

CREATE CONTINUOUS VIEW logs_hour AS
  SELECT date_trunc('hour', request_at),
    request_method,
    request_scheme,
    request_host,
    request_path,
    user_id,
    request_ip_country,
    request_ip_region,
    request_ip_city,
    request_ip,
    response_status,
    gatekeeper_denied_code,
    log_imported,
    COUNT(*) AS total_hits,
    AVG(timer_response) AS avg_timer_response,
    COUNT(DISTINCT user_id) AS distinct_user_id,
    COUNT(DISTINCT request_ip) AS distinct_request_ip
  FROM log_stream
  GROUP BY date_trunc('hour', request_at),
    request_method,
    request_scheme,
    request_host,
    request_path,
    user_id,
    request_ip_country,
    request_ip_region,
    request_ip_city,
    request_ip,
    response_status,
    gatekeeper_denied_code,
    log_imported;

Miscellaneous Notes

The continuous views are essentially materialized views that get updated on the fly. But given some high-cardinality fields and enough data, we could quickly find ourselves back in the same boat of slow queries (since each combination of results in the continuous view is represented as a new row). It should be significantly better than querying the raw data in row-based storage, but without loading a lot more of our sample data into place, it's tough to know just how it would perform.
The recently released Postgres 9.5 now offers a way to setup standard table partitions for foreign data wrapper tables. We'd need to use partitioning for the cstore_fdw tables. PipelineDB says their released based on Postgres 9.5 will be released soon. This could simplify some things and eliminate the use of funky UNIONs above to achieve something similar.

Pros/Cons

Kylin
- Pros:
- We can query efficient rollup data and raw data with the same SQL query.
- It's relatively easy to adjust the rollup cubes if we later find we want to improve the efficiency of certain queries (without having to touch the queries themselves).
- It's "petabyte-scale" and seems proven at eBay, so it's definitely scalable far beyond what we're dealing with.
- Top-level Apache project that recently came out of incubation.
- The project seems largely oriented around tackling exactly what we're trying to achieve. and their roadmap seems pretty aligned with things we're interested in (streaming, more statistic functions, etc).
- With 1 month of sample data loaded, it answered queries against cubes in ~1 second. But we'd still need to explore how it performs with all the data loaded.
- Cons:
- It's still pretty new in the database and opensource arena.
- It's still undergoing pretty rapid and significant changes.
- I've encountered various odd bugs (float datatypes don't really work,you have to switch to doubles), or features that aren't supported yet (no timestamp support and needing to structure dates & times as separate fields).
- It's somewhat light on statistical features (top-N is coming in the next release, but no support for percentiles, which would eventually be nice for things like performance graphs).
- Documentation is light. The basics are there, but I've had to figure a lot of things out (like recommended practices and more detailed explanations) by perusing their mailing list, old presentations (which aren't always accurate now), etc.
- Hadoop ecosystem: This is the big one. It's also what gives us the known scalability paths, so in some ways, it's also a plus. However, all the hadoop dependencies are a pretty significant and resource-heavy thing to toss onto the stack. It also adds a lot of moving pieces, parts, and complexity to running and managing the stack (setting up a proper hadoop system was by far the biggest stumbling block for me in just testing things).
  - To some degree, I'd be more comfortable if we were the only ones that had to set it up, but it causes me more headaches when trying to think about how to integrate and support this in the broader API Umbrella project in an easy to use fashion.
  - Given the interesting raw SQL approach Kylin takes, we could perhaps implement different storage adapters to try and alleviate the setup issue for the Open Source project. We could ship with something like sqlite or postgres by default, but then recommend Kylin if necessary (storage adapters do add complexity, though, and there will be differences among the different databases that aren't always fun to deal with).
PipelineDB+cstore_fdw
- Pros:
- It's really just Postgres 9.4 with extensions (so normal postgres tables can also be stored in there, and extensions like cstore_fdw can be used).
- Lots of pretty robust statistic capabilities (top-N, percentiles, most everything postgres supports, plus some more).
- It's very lightweight to install and run. All that needs to be started is the single postgres server process and that would handle both the streaming, columnar storage, and other normal tables if we want. This is probably the biggest benefit in my book, since it makes it much easier to manage and operate (even testing it was significantly easier than all the hoops I had to jump through to get hadoop running).
- Works with any normal postgres clients and tools: To connect, it's just the normal postgres wire protocol, so we'd have good support for it everywhere in our stack. And normal things like psql work, which is quite nice.
- With 1 month of sample data loaded, it answered most queries against the views in <1 second (generally faster than Kylin once things were cached). However, I'm a little more concerned we could end up with slow counting again once we loaded a lot more data in. It's tough to say without more testing or optimizing our views in different ways.
- Ingest speeds were ample for us currently: On the test server I'm running with multiple continuous views setup, I was ingesting around 400k records per minute. As of now, that would be plenty for out traffic patterns.
- Cons:
- It requires more manual logic on our app's side to figure out what queries can be answered by the continuous views vs raw data (although, the more I think about it, I don't think this is necessarily a huge lift).
- PipelineDB is extremely new: Their first release just came out last July. By leveraging Postgres, it makes me more comfortable with it than I would normally be, but it's still very new and I'm not really sure of other big users.
- Commercial path to scaling out and high availability: Since Postgres generally just runs on one server, scaling beyond a beefier single server is a bit more difficult and a less worn path. PipelineDB did release an enterprise edition 2 weeks ago (did I mention this was new??) that offers distributed queries, high availability, and more efficient storage. However, given how new it is, I'm not sure of the quality, other users, pricing, etc. And that's not to say a commercial path to scaling out is necessarily a bad thing (it could be worth it if we reach a certain scale). However, we'd probably need to get a better sense of where our boundaries would really be with the current system.
- No Amazon RDS: I'm not super concerned about this, but it's just worth noting that while PipelineDB offers nearly all the nice benefits of being Postgres, we couldn't use this on Amazon's RDS for easier hosting.
- All our eggs in this VC basket: PipelineDB is all open source (well, except for their Enterprise version), but the primary developers are the folks working for the PipelineDB startup company. If they pan out, then that would be great (I've been impressed thus far), but with startups, there's always the chance that they fizzle and disappear or pivot to something else. At least in this case, it would be open source, so it's possible other people could pick up the project. However, since everything is so new and it doesn't have a big community yet, it's hard to say if that would actually happen. In any case, it wouldn't be a lot of fun to get stuck with an unsupported database in 1-2 years. That being said, a great thing about it being SQL oriented is that it would at least make the transition to something like Kylin (or any other SQL-based solution), a little more feasible.

Anyway.. If you've actually gotten this far (again), my apologies for yet another missive. I think we have two pretty good options, just with different trade-offs. I can give you the very quick recap when we chat tomorrow, and then hopefully we can make a decision and then be on our way to much speedier analytics. Vroom, vroom! :rocket: :rocket:

You continue to be fantastic, @GUI. A few comments coming from a very limited understanding:

The main tricky part would be supporting the older "Advanced Queries" input, where we accepted raw Lucene queries. I don't think this is frequently used,

Is there an easy way to measure the usage here? If we're seeing the same bookmarked queries over and over, perhaps we should start by removing the option in the UI for any user who doesn't have it pre-filled (i.e. due to a bookmark) and see if anyone complains? I'd lean towards axing it.

we need some way of determining that filter for nearly any request for non-superuser admins.

I agree that we can't aggregate prefix match queries, but I wonder if we can limit their usage in terms of common workflows. For example, we use prefix matches for admin filtering (so it's inherently used in almost every query) -- what if, instead, we added a field to each log associating it with a particular agency? Are there other optimizations of this form, where we can cut off a huge chunk of the usage space? Would they be worth the effort?

Queries that must be answered by the raw data will need to query a UNION of all the distinct tables that apply to that query

This feels frightening to me given the number of combinations required for common queries. Consider just an admin view of the last month of data. That may well be an irrational fear; I've never encountered anything quite like this and this is what the hadoop solution is doing behind the scenes.

But you think this might be something made easier in a later release?

We could ship with something like sqlite or postgres by default, but then recommend Kylin if necessary (storage adapters do add complexity, though, and there will be differences among the different databases that aren't always fun to deal with)

Agreed. Despite the legitimate concern over complexity, I think this is a worthwhile approach.

All our eggs in this VC basket

I was sold until you brought this up. It really sways me against PipelineDB in (again) probably irrational ways. Database companies are a dime a dozen in recent years; for all of the ones that hang around, several more have failed. We don't remember the failures, though, because they never gained a user base. Maybe that's the question, then -- does PipelineDB have a user base outside of their "customers"?

Well, this is a belated followup on multiple fronts.

For anyone wondering, last month we decided to pursue the Kylin approach. A few factors contributed to the decision:

It's in the Apache foundation as a top-level project, which would seem to make it a safer bet from an open-source perspective than relying on something backed by a startup.
Hadoop's dynamic handling of partitions is quite robust and useful for swapping data in and out while processing (Postgres's partitions seem to require a bit more hand-holding).
After playing with Kylin, we are definitely in its sweet spot of functionality. It hasn't all been roses with Kylin, but when it works, it's pretty fantastic in terms of giving us significantly faster query results without us having to worry about maintaining all the pre-aggregated results ourselves.

So where are we with Kylin? It's mostly implemented, and it does look like it will solve our performance issues in a scalable way. I'll be sure to post some benchmarks in this thread to compare before and after (but it's significant for queries over a longer time period). Here's some initial architecture documentation on how these pieces have come together.

Unfortunately, though, this also seems like the my nemesis of a task that I perpetually think I'm nearly complete with, only to uncover other details or issues. So as a bit of an update for this past week, @cmc333333 & @gbinal:

I upgraded to the new Kylin 1.5 release, since it includes some new functionality we'd like, and it would save us time down the line of having to reprocess all our data in the new data format.
Things looked good under Kylin 1.5, so I finally kicked off the big bulk migration to get all our historical data into Kylin over the weekend. That was nearly complete late Tuesday, when I then discovered a bug in Kylin 1.5 with regards to how all our data had been processed. I wrote the bug up here: https://issues.apache.org/jira/browse/KYLIN-1527
After trying to figure out the best solution (we could keep using 1.5 if we processed our data differently), I ended up reverting back to 1.2 earlier today, and I've kicked off the big migration again. In hindsight, I shouldn't have tried a last minute upgrade to 1.5.
There's still some things to wrap up, but I've at least finally gotten around to opening up the pull request (sorry for the many delays, CM), and writing up some documentation for this new setup: https://github.com/NREL/api-umbrella/pull/227

Oh, and I also stumbled into one other issue on Wednesday, which also required starting over with the bulk migration again. I uncovered certain queries that wouldn't work with the way the data had been structured in certain cases (see https://github.com/NREL/api-umbrella/pull/227/commits/425f546ed76e8324e779c1e81e04e25e019e2d02 for more details). It's certainly not ideal that I just discovered that type of issue this week, but fingers crossed that it was our last setback with the big bulk migration.

Okay, finally some real progress on deploying all this!

Earlier this week, I had gotten a new, cleaned up hadoop server up and running for the new databases to run on (in a way that's repeatable and better documented with cloudformation and chef).
I've built new packages from the analytics-revamp branch and deployed the changes to all our servers.
The app servers are continuing to populate ElasticSearch, and this is what the web interface continues to use.
- Despite this remaining status quo from a functional perspective, there were some internal changes with how we send the logs to ElasticSearch (rsyslog instead of heka), and some code changes in the web admin (to accommodate the eventual switch to Kylin). So I've been triple-checking all of those changes and how it all behaves in production, and I think everything looks good.
In addition to populating ElasticSearch, the app servers are also now sending the logs to Kafka which populates per-minute HDFS files on the hadoop server. This is what Kylin will eventually ingest and process.

So in terms of next steps:

For now, I've only enabled the rsyslog -> kafka -> flume -> per minute HDFS files flow.
The next step will be to turn on the background job we have ready that appends the per-minute HDFS TSV files into day-based ORC files that are suitable for querying, processing, and long-term storage. This background job is ready, but I just wanted to get a baseline of system load for maybe a day or two without that running. So I'll probably enable this additional background job over the weekend.
After that, the next step is having Kylin process the ORC files (every 30 minutes or every hour). I still need to write that background job, but it should be pretty trivial. Again, to gather a baseline of system load, I'd like to let it run for a couple days with just the ORC conversion going on. So I'll probably plan to add the Kylin processing into the mix sometime early next week.
At that point, we're mostly done from a development and deployment perspective, but the plan was to probably let both analytics systems run in parallel for a little while, so we can do comparisons before flipping the admin UI over to using the new system (and then we'll still keep them running in parallel for a while longer in case anything goes terribly wrong and we need to revert). I've also been thinking about adding an easy way for us to flip the admin UI over to use the new underlying analytics database so we can do more ad-hoc testing.

Well, a rather belated status update on this:

Shortly after the last update, we ended up hitting a bug in Kylin that was triggered whenever Kylin began to merge together daily segments. After 7 days worth of individual data in Kylin, it would begin to merge the daily data together for better querying efficiency. However, something in our data triggered a bug in Kylin which led to the merge process to fail. This failed merge process in turn couldn't be stopped, leading to various issues.

Not much progress was made on this during the month of May on any of this due to travel schedules and working on other bug fix tasks (mainly within the admin tool).

Fast forward to last week, and I finally dove back into this. The short of it is that by upgrading to the latest version of Kylin (1.5.2), these segment merge issues have gone away. I've also reworked some of our Kylin refresh logic to simplify our automatic refresh process, and fixed several issues/bugs with the earlier setup. I've done a lot of manual testing against segment merge issues, and things seem solid. The automatic refresh process has also been running quite smoothly for the past week. There's still a few small things to sort out, but things finally seem like they're to a more stable point for all the underlying pieces (knock on wood).

So while we have unfortunately hit some snags with this task partially due to the choice to use Kylin, that was somewhat of a known risk going in (given how new the technology is). But now that things seem to have reached a more stable point, Kylin does seem to be performing impressively and doing what we need. As a quick example, the analytics drilldown query currently takes ~13 seconds from ElasticSearch for 1 month of data, and the same query returns from Kylin in ~1 second. The differences become even more pronounced when querying more than 1 month of data.

So the primary remaining things on my radar:

When I built the new cubes after upgrading to Kylin 1.5, I made a mistake with the cube design so the queries where we filter out requests we reject (due to being over rate limit, not providing a valid key, etc), aren't properly optimized for Kylin queries. Since this is one of our default filters, it's important this this be optimized. I have a new cube design ready that should remedy this issue, we just need to reprocess all the old data (and I haven't done that already, since I was interested in seeing how the live data processing was going to work over the span of a week).
There were some small discrepancies between the Kylin analytics results and our ElasticSearch results. I believe I pinpointed the problem to a bug in how both systems were handling long URLs (see https://github.com/18F/api.data.gov/issues/344). So with that fixed, I'm hoping the analytics data between the two systems should now match exactly. But we should have a better sense after a full day of running with those fixes in place.
In order to simplify some of the live data processing issues, I've switched to only processing our data in Kylin once per night. If you switch to the "minute" view in the admin tool, you'll still get live data (since that bypasses Kylin), so this doesn't feel like a terrible compromise. With this switch, the main thing I'd like to do is make some interface tweaks in the admin UI to make this more obvious (eg, default the date range to end on the previous day, so the current day doesn't always appear empty, and then perhaps more clearly label the "Minute" option as also providing live data).
There's a few specific queries that are a bit slow in Kylin (like the specific list of top 10 users in the big Filter Logs view). In general, these are still faster than what we had before, but these slow queries are for things that aren't actually used that often. In elasticsearch, we got most of this data for free with the other types of queries we were performing for the page, but these require separate queries in Kylin. So I think if we were to defer these queries until the admin actually requested the data (eg, clicked to view the top 10 users), the overall responsiveness of the interface could be improved a lot. So while this isn't strictly necessary, it would be nice to implement.

Le sigh... So a quick update on this infernal task...

As of the last update, I was mainly tracking down some loose ends for dealing with the live data processing, and then needing to reprocess all the historical data in Kylin with a different cube design (to better optimize it for the types of queries we run).

Good news: I think all the loose ends related to live data processing look good. After letting things run for a week, we weren't seeing any discrepancies in the counts between the old and new system.

Bad news: I kicked off the historical data re-processing with an updated cube design, but hit some snags with that. The updated cube design basically restructures some of our dimensions and calculations to better match how we need to query the data. The updated design takes longer to process and takes more disk space, but it allows for the queries we need to commonly perform to be more efficient (the updated design is actually pretty similar to how we had originally designed things in Kylin 1.2, so I had just gotten a bit over-eager in trying to optimize things during the Kylin 1.5 upgrade).

In any case, the problem with the new cube design is that after processing all our historical data up to 2015, things then go a bit bonkers when in the final stage of writing the 2015 data and it ends up consumes all our disk space on the server. So the general cube design seems to work for previous years with less data, but something related to the amount of 2015 data or something specific in the 2015 data triggers an unexpected surge in disk use when generating the cubes with this cube design (but we were able to process all the data in Kylin 1.5 with the previous simpler cube design).

It takes about 4 days of background processing to get to the point of failure with the 2015 data, so after the first failure I tried giving the server more disk space to see if it just needed temporary space to write data, and then ran it again. Even with more disk space allocated, we ran into the same snag on the second run, so either it needs waaayy more temporary disk space than I'd ever anticipate, or some other bug is at play (we don't see similar disk usage surges when writing the previous years' data, so I'm inclined to think it's some other bug or oversight).

After the second failure a couple weeks ago, I haven't really had a chance to look at this in more detail, but I think I should be able to look at it again this week and hopefully get another attempt kicked off. Here's my general thoughts on debugging this further:

I need to look at the the Kylin and Hadoop logs in more detail to see if they give any hints as to what's going on.
I also need to check the Kylin bug tracker to see if any similar issues have been reported with the latest version of Kylin.
Barring anything useful turning up in those locations, then my plan would be to try and generate the historical data 1 month at a time instead of 1 year at a time. Since each batch would involve less data, I'm hopeful this might sidestep the issue if it's related to the amount of data in a single batch job (since we are able to generate other previous years with less data successfully). But if that doesn't work, then it might also point towards a more specific data issue with 2015 data (which the monthly approach might also help narrow down).
We could also try adjusting the cube design again.
I may also temporarily bump up the instance size to see if I can get these historical jobs to run more quickly, so we can get quicker feedback (although, since this is all background processing time, it's not like this is a huge blocker for us right now).

Hi Nick! Just saw this thread now, and I'm impressed on how many alternatives you evaluated. I would like to suggest a missing one: Google BigQuery

One of the tools you evaluated was "Google Analytics: I'll throw this out there agin, because it would be nice to offload this to someone else. However, the main issue I think is volume of data (I'm not sure our volume is allowed under the current paid account), and the lack of api key information. Since we can't send user information to Google Analytics, we'd have no way to associate the data with api keys/users. Maybe this would still be interesting for aggregate counts, but in interviewing the NREL users this week, they were all interested in identifying the specific API keys that were users."

Comparing BigQuery to that review of GA:

You would "offload this to someone else".
"the main issue I think is volume of data" - not with BigQuery, it will fit all.
"we can't send user information to GA" - not a problem with BigQuery. You can store whatever you choose to store.

From the previous comment on this thread:

"reprocess all the historical data in Kylin with a different cube design" - no need for cube design with BigQuery. Everything runs as a fast full table (columns really) scan.

With BigQuery you can have private data, shared data, and even public data - cost of queries are distributed to the people querying, and everyone gets a free monthly terabyte.

For example, here I'm querying GitHub's events timeline - and you can too:

https://medium.com/@hoffa/the-top-github-projects-per-country-92c275e19409

+1 for BigQuery

Also I don't see it listed here so have you tried http://www.memsql.com/ as an on-premise solution? Proprietary/close-source but they have a free usage community edition. Distributed relational database with mysql compatibility, built-in columnstore tables + in-memory rowstore tables. Should handle all of these analytics queries with ease.

We store a few hundred million rows per day and have no trouble with high-cardinality queries with several dimensions joining fact tables.

How could it come that https://clickhouse.yandex/ didn't appear yet in this thread? Probably because it was open sourced in June 2016? According to different benchmarks it shows quite cool performance.

Closing as stale, on the assumption that the need will resurface if it still exists.

18F / api.data.gov