Closed GUI closed 2 years ago
@cmc333333: Just to followup on our conversation, Druid was the other interesting database I had stumbled upon in this space.
We've decided to take a look at this, since analytics performance does suffer with more traffic and is probably the biggest pain point in the admin.
I've spent part of this week exploring some options and ideas in this space. My thoughts on the best plan has changed pretty much each day, so unfortunately I think my thoughts might be a bit jumbled, but I want to at least get some of these ideas out there for feedback.
I described above how we're currently logging things with ElasticSearch, but here's a bit more technical details about our current setup:
It would also help to give a brief overview of the various things we're currently doing with analytics. In some ways, I think we have a variety of competing requirements which is what makes all this a bit tricky to tackle. I'm certainly open to simplifying or removing functionality if it turns out nobody is using it.
I also lucked out this week and there was a small flurry in analytics activity from NREL folks needing to pull various things. So I interviewed a few NREL people on whether they were able to easily get the analytics they wanted out of the current system, or what issues they ran into. It was a small sample size of users, but I think it at least helped demonstrate some of the concrete use-cases from average admins.
So here's some use-cases/requirements/what we're currently doing:
system_id
query string parameter was being used (basically, figuring out which systems were most frequently being called via the API). This is just one data point, but it does demonstrate a potential use-case.system_id
example above).So how do we make things better? Here are some key areas I was looking into:
Here's a very quick rundown of various technologies I've looked into this week. Basically, there's a crazy amount of options in this space, all with subtly different characteristics and pros/cons. It also seems like this landscape is pretty quickly evolving, and some of these options are really new. I should heavily caveat my summaries with the fact that I haven't had a chance to benchmark many of these with our real workloads, so a lot of these are my (probably uninformed) thoughts just based on reading various things. In any case:
Well, I have some other summary thoughts, and I meant to re-read this, but in an attempt to get this out before our meeting today, I'll just toss this up in it's current state (which is probably littered with typos/incoherent babbling/trailing thoughts...).
@cmc333333, @gbinal: I'm waiting for some super-long running data processing jobs to finish (they've been running all day), so unfortunately I don't quite have a full status update on this analytics front today. However, I think I have a decent idea on a couple approaches to tackling this analytics stuff that I'll outline in more detail this week once I get some of those final test results.
But for some quick updates and summary:
I did get our postgres DBA to give us a sanity check on the raw-Postgres approach. He tried a few creative indexing approaches that may help, but in the end his discoveries largely mirrored what I had found: unless we can get the queries to always hit specific indexes, counts are going to be exceptionally slow. So unless we shift away from allowing ad-hoc queries, it doesn't seem like this is going to be a great fit.
He did point me at PipelineDB as another possibility, and while we don't need another option in the mix, PipelineDB actually offers some pretty interesting capabilities. It's continuous views are extremely similar to what InfluxDB offers, but PipelineDB is built on top of Postgres, which makes me more comfortable (versus the brand new storage engine InfluxDB is working on). This also allows for some interesting possibilities, like combining it with the cstore_fdw plugin for columnar storage of the raw data in Postgres (which I experimented with earlier and does seem more suitable for our workload).
But the main thing I've been looking at is Kylin, which seems like it might be best suited for our combination of known queries as well as ad-hoc queries. After more testing, it seems like it can handle all the types of queries we currently want to perform in an efficient manner using its pre-processed data cubes (which are easy to generate). The current test I'm running involves more data segmented on lots of dimensions, which is taking a very long time to pre-process. That makes me a little concerned about trying to handle real-time-ish data, but this is also sort of a worst-case scenario test, since the dimensions aren't configured in a very optimized fashion, and we may not actually need all these dimensions pre-computed (Kylin may also be releasing functionality to explicitly deal with real-time data later this year). But once this test finishes, then I'll have a better idea of where things stand.
My general sense is that using Kylin would allow for much simpler coding within the admin app, but at the sake of stack complexity (eg, more moving pieces and components that need to be running on the database server). On the flip-side, PipelineDB would probably be a smaller stack to setup, but it would involve more custom coding.
Whatever we do is going to involve more changes than simply picking database X, so my plan is to put together a more detailed outline of how exactly these options would be implemented (for example, how data gets ingested, how we partition the data, how we perform common queries, how we perform less common ad-hoc queries, how querying aggregate information works, how querying raw request records works, etc). I'd like to get those more detailed outlines together by this Thursday for both Kylin and PipelineDB. While I'm probably leaning towards Kylin at this point, I think having some other approach for comparison would be useful, and thinking through a couple of different approaches in detail would probably be valuable. But then hopefully we can chat about this on Friday and move forward with implementation.
Well, after more testing, I'm now waffling back and forth. But on the upside, I think we have a couple pretty decent options. There's really a multitude of options, but here's the quick summary of the couple of options I've focused on:
(and cc @david-harrison who helped look at postgres, mentioned pipelinedb, and generally bounced ideas around--thanks!)
So what follows will probably be an incomprehensible, messy brain-dump after exploring these options for a bit this past week, but I figured I should make note of some of this. I'm not sure I have a firm conclusion yet, but I can give you the tl;dr when we talk tomorrow on the phone, and then we can hopefully move from there with a decision.
Since I'm not sure we've clarified this anywhere before, here's a quick summary of the different analytics screens we currently have and what they would need if we want to keep things the same (but again, simplifying or changing our interface and capabilities is always be on the table):
/foo
and /bar
) and the number of hits. The charts show the same over time. Drill in again, and you see second-level URL paths for that parent path (eg, /foo/moo
and /foo/boo
). Repeat to whatever depth you want.Since the frontend only interacts with the analytics via our REST APIs, the basic idea would be to update those APIs in a backwards compatible way by replacing the ElasticSearch queries with whatever new queries we have. I think this should be relatively straightforward for any queries that use the form interface for building queries since we get the user's various filters as structured JSON conditions.
At some point, I think it would make sense to begin transitioning some of the admin APIs over to Lua to simplify our new stack, but for this iteration, I think we just keep all the APIs in the Ruby on Rails project to minimize change.
The main tricky part would be supporting the older "Advanced Queries" input, where we accepted raw Lucene queries. I don't think this is frequently used, so I'm not sure how far we want to go to support these queries in a backwards compatible way. But I know I have sent out some links that people have bookmarked using that query style (for lots of conditions, the form builder interface can be pretty cumbersome), and there are also certain types of queries you can't really achieve via the form builder (eg, range queries or queries on fields we still haven't exposed in the UI). In any case, if we do want to try to support the advanced queries, I found lucene_query_parser, which might be of use in transforming the lucene queries into something we could translate into our SQL conditionals.
What we're trying to do is perform pre-aggregations on all the most common fields we need to display for our analytics results. In the following examples, I've tried to setup realistic scenarios for what specific fields I think we need to pre-aggregate on in order to answer the bulk of our common analytic queries. However, it's worth calling out cardinality and how that interacts with some of our required filters and relates to performance.
Basically, high cardinality fields are the nemesis of pre-aggregation. If a field has high cardinality and the values are all different, then aggregating on that field doesn't really achieve any savings (since each value is unique). In these setups, I've included the URL path (so for example, on https://developer.nrel.gov/api/alt-fuel-stations/v1.json?api_key=DEMO_KEY&fuel_type=E85,ELEC&state=CA&limit=2
, the URL path would just be /api/alt-fuel-stations/v1.json
). We need to include this for a few reasons:
api.data.gov/regulations/*
), so we need some way of determining that filter for nearly any request for non-superuser admins.However, this field causes us probably the most problems with high cardinality (and potential for abuse). User IDs/API keys and IPs are also up there, but the URL path has significantly more unique combinations than anything else in our data. The example above rolls up nicely (/api/alt-fuel-stations/v1.json
), but where it begins to cause problems is when dynamic IDs are part of the path (which is pretty common in RESTful APIs), like /api/alt-fuel-stations/v1/72029.json
, /api/alt-fuel-staitons/v1/70030.json
, and so on.
It is hard to optimize around this problem given our current admin permissions and the fact that we don't know much about the underlying APIs (so we don't know about dynamic pieces of the URL or what really makes up a logical API that admins want to query on). I don't think we're necessarily to the point where this will cause problems, but it's an issue to be aware of. I have some ideas on how to possibly improve this, but without a more fundamental shift in what type of analytics we want to provide or how we manage APIs, I think this will be a potential problem with any solution.
Anyway, onto the options! These are loosely how I think everything could work in our overall system with these two different databases. However, this is based on only some initial tests, so it's definitely possible some of my assumptions or understanding of things aren't quite right. But generally speaking, here's what I'm thinking might be possible:
YYYY-MM-DD
) and request host (eg, developer.nrel.gov
).YYYY-MM-DD
) and request host (eg, developer.nrel.gov
).I've largely mirrored the cube design from Kylin above, grouping by each dimension and including the same measures. It's definitely possible to structure things differently and with more separate views that might be smaller, but for now, I went with something similar to the cube design (more continuous views also slow down inserts, so you don't necessarily want to go overboard). I have been creating multiple views to group by each distinct time period (eg, date_trunc('hour', request_at)
, date_trunc('day', request_at)
, date_trunc('month', request_at)
).
CREATE CONTINUOUS VIEW logs_hour AS
SELECT date_trunc('hour', request_at),
request_method,
request_scheme,
request_host,
request_path,
user_id,
request_ip_country,
request_ip_region,
request_ip_city,
request_ip,
response_status,
gatekeeper_denied_code,
log_imported,
COUNT(*) AS total_hits,
AVG(timer_response) AS avg_timer_response,
COUNT(DISTINCT user_id) AS distinct_user_id,
COUNT(DISTINCT request_ip) AS distinct_request_ip
FROM log_stream
GROUP BY date_trunc('hour', request_at),
request_method,
request_scheme,
request_host,
request_path,
user_id,
request_ip_country,
request_ip_region,
request_ip_city,
request_ip,
response_status,
gatekeeper_denied_code,
log_imported;
psql
work, which is quite nice.Anyway.. If you've actually gotten this far (again), my apologies for yet another missive. I think we have two pretty good options, just with different trade-offs. I can give you the very quick recap when we chat tomorrow, and then hopefully we can make a decision and then be on our way to much speedier analytics. Vroom, vroom! :rocket: :rocket:
You continue to be fantastic, @GUI. A few comments coming from a very limited understanding:
The main tricky part would be supporting the older "Advanced Queries" input, where we accepted raw Lucene queries. I don't think this is frequently used,
Is there an easy way to measure the usage here? If we're seeing the same bookmarked queries over and over, perhaps we should start by removing the option in the UI for any user who doesn't have it pre-filled (i.e. due to a bookmark) and see if anyone complains? I'd lean towards axing it.
we need some way of determining that filter for nearly any request for non-superuser admins.
I agree that we can't aggregate prefix match queries, but I wonder if we can limit their usage in terms of common workflows. For example, we use prefix matches for admin filtering (so it's inherently used in almost every query) -- what if, instead, we added a field to each log associating it with a particular agency? Are there other optimizations of this form, where we can cut off a huge chunk of the usage space? Would they be worth the effort?
Queries that must be answered by the raw data will need to query a UNION of all the distinct tables that apply to that query
This feels frightening to me given the number of combinations required for common queries. Consider just an admin view of the last month of data. That may well be an irrational fear; I've never encountered anything quite like this and this is what the hadoop solution is doing behind the scenes.
But you think this might be something made easier in a later release?
We could ship with something like sqlite or postgres by default, but then recommend Kylin if necessary (storage adapters do add complexity, though, and there will be differences among the different databases that aren't always fun to deal with)
Agreed. Despite the legitimate concern over complexity, I think this is a worthwhile approach.
All our eggs in this VC basket
I was sold until you brought this up. It really sways me against PipelineDB in (again) probably irrational ways. Database companies are a dime a dozen in recent years; for all of the ones that hang around, several more have failed. We don't remember the failures, though, because they never gained a user base. Maybe that's the question, then -- does PipelineDB have a user base outside of their "customers"?
Well, this is a belated followup on multiple fronts.
For anyone wondering, last month we decided to pursue the Kylin approach. A few factors contributed to the decision:
So where are we with Kylin? It's mostly implemented, and it does look like it will solve our performance issues in a scalable way. I'll be sure to post some benchmarks in this thread to compare before and after (but it's significant for queries over a longer time period). Here's some initial architecture documentation on how these pieces have come together.
Unfortunately, though, this also seems like the my nemesis of a task that I perpetually think I'm nearly complete with, only to uncover other details or issues. So as a bit of an update for this past week, @cmc333333 & @gbinal:
Oh, and I also stumbled into one other issue on Wednesday, which also required starting over with the bulk migration again. I uncovered certain queries that wouldn't work with the way the data had been structured in certain cases (see https://github.com/NREL/api-umbrella/pull/227/commits/425f546ed76e8324e779c1e81e04e25e019e2d02 for more details). It's certainly not ideal that I just discovered that type of issue this week, but fingers crossed that it was our last setback with the big bulk migration.
Okay, finally some real progress on deploying all this!
analytics-revamp
branch and deployed the changes to all our servers.So in terms of next steps:
rsyslog -> kafka -> flume -> per minute HDFS files
flow.Well, a rather belated status update on this:
Shortly after the last update, we ended up hitting a bug in Kylin that was triggered whenever Kylin began to merge together daily segments. After 7 days worth of individual data in Kylin, it would begin to merge the daily data together for better querying efficiency. However, something in our data triggered a bug in Kylin which led to the merge process to fail. This failed merge process in turn couldn't be stopped, leading to various issues.
Not much progress was made on this during the month of May on any of this due to travel schedules and working on other bug fix tasks (mainly within the admin tool).
Fast forward to last week, and I finally dove back into this. The short of it is that by upgrading to the latest version of Kylin (1.5.2), these segment merge issues have gone away. I've also reworked some of our Kylin refresh logic to simplify our automatic refresh process, and fixed several issues/bugs with the earlier setup. I've done a lot of manual testing against segment merge issues, and things seem solid. The automatic refresh process has also been running quite smoothly for the past week. There's still a few small things to sort out, but things finally seem like they're to a more stable point for all the underlying pieces (knock on wood).
So while we have unfortunately hit some snags with this task partially due to the choice to use Kylin, that was somewhat of a known risk going in (given how new the technology is). But now that things seem to have reached a more stable point, Kylin does seem to be performing impressively and doing what we need. As a quick example, the analytics drilldown query currently takes ~13 seconds from ElasticSearch for 1 month of data, and the same query returns from Kylin in ~1 second. The differences become even more pronounced when querying more than 1 month of data.
So the primary remaining things on my radar:
Le sigh... So a quick update on this infernal task...
As of the last update, I was mainly tracking down some loose ends for dealing with the live data processing, and then needing to reprocess all the historical data in Kylin with a different cube design (to better optimize it for the types of queries we run).
Good news: I think all the loose ends related to live data processing look good. After letting things run for a week, we weren't seeing any discrepancies in the counts between the old and new system.
Bad news: I kicked off the historical data re-processing with an updated cube design, but hit some snags with that. The updated cube design basically restructures some of our dimensions and calculations to better match how we need to query the data. The updated design takes longer to process and takes more disk space, but it allows for the queries we need to commonly perform to be more efficient (the updated design is actually pretty similar to how we had originally designed things in Kylin 1.2, so I had just gotten a bit over-eager in trying to optimize things during the Kylin 1.5 upgrade).
In any case, the problem with the new cube design is that after processing all our historical data up to 2015, things then go a bit bonkers when in the final stage of writing the 2015 data and it ends up consumes all our disk space on the server. So the general cube design seems to work for previous years with less data, but something related to the amount of 2015 data or something specific in the 2015 data triggers an unexpected surge in disk use when generating the cubes with this cube design (but we were able to process all the data in Kylin 1.5 with the previous simpler cube design).
It takes about 4 days of background processing to get to the point of failure with the 2015 data, so after the first failure I tried giving the server more disk space to see if it just needed temporary space to write data, and then ran it again. Even with more disk space allocated, we ran into the same snag on the second run, so either it needs waaayy more temporary disk space than I'd ever anticipate, or some other bug is at play (we don't see similar disk usage surges when writing the previous years' data, so I'm inclined to think it's some other bug or oversight).
After the second failure a couple weeks ago, I haven't really had a chance to look at this in more detail, but I think I should be able to look at it again this week and hopefully get another attempt kicked off. Here's my general thoughts on debugging this further:
Hi Nick! Just saw this thread now, and I'm impressed on how many alternatives you evaluated. I would like to suggest a missing one: Google BigQuery
One of the tools you evaluated was "Google Analytics: I'll throw this out there agin, because it would be nice to offload this to someone else. However, the main issue I think is volume of data (I'm not sure our volume is allowed under the current paid account), and the lack of api key information. Since we can't send user information to Google Analytics, we'd have no way to associate the data with api keys/users. Maybe this would still be interesting for aggregate counts, but in interviewing the NREL users this week, they were all interested in identifying the specific API keys that were users."
Comparing BigQuery to that review of GA:
From the previous comment on this thread:
With BigQuery you can have private data, shared data, and even public data - cost of queries are distributed to the people querying, and everyone gets a free monthly terabyte.
For example, here I'm querying GitHub's events timeline - and you can too:
+1 for BigQuery
Also I don't see it listed here so have you tried http://www.memsql.com/ as an on-premise solution? Proprietary/close-source but they have a free usage community edition. Distributed relational database with mysql compatibility, built-in columnstore tables + in-memory rowstore tables. Should handle all of these analytics queries with ease.
We store a few hundred million rows per day and have no trouble with high-cardinality queries with several dimensions joining fact tables.
How could it come that https://clickhouse.yandex/ didn't appear yet in this thread? Probably because it was open sourced in June 2016? According to different benchmarks it shows quite cool performance.
Closing as stale, on the assumption that the need will resurface if it still exists.
Due to a rather significant spike in API hits, our analytics database has suddenly grown by quite a bit for this month. This is negatively impacting the performance of the analytics queries and graphs in the admin tool. I've increased our hardware sizes to get things generally working, but things can be quite poky the first time an admin performs queries until the caches spin up.
So one issue is that we perhaps just need to look into tuning our ElasticSearch queries or the ElasticSearch database itself. In particular, the "Filter Logs" view could possibly be optimized, since it's really performing several different requests in parallel (one date histogram for the chart over time, more aggregations for the "top X" IPs/users, and another for the most recent raw logs for the table). So that's probably worth a look, but I'm not too sure there's much different we can do with those queries if we want all that information on one page.
The more general issue is that I think we need to take a longer look at how we handle analytics data, since with our current approach this problem will only get worse with more usage, and I'd like to have more of a plan other than throwing more hardware at it when things break (or if more hardware is the plan, that's fine, but I'd at least like to have that planned a bit better in terms of when we need it, costs, etc). So a few random notes on that front: