Make stats aggregations real time

jde commented 8 years ago

Currently stats are aggregated via a batch job.

Implement an event-based architecture that will keep stats up to date in as close to real time as possible.

alexbyrnes commented 8 years ago

This connects to #91. as long as we're moving stats we might as well look at more real time solutions. I think this comes down to 1) only updating modified users and 2) switching to a streaming approach so only the current comment is necessary to update the average/max/min/count etc.

Number 1 is easier and may make up the difference we need for real time. I'm defining real time as "done processing before the next import." So number of expected users per import before the current import window which is 20 seconds. Number of aggregations is around 36. 42 with a readability metric.

Atoll is very fast for a small number of comments but pyspark takes over at around 6000. I'm also submitting a pyspark.sql version which is a lot less code. There's more flexibility in doing it mapreduce style so if there's an optimization that pyspark.sql isn't thinking of we should do it that way. Currently that's not the case. The sql version is faster.

Regardless we can switch back and forth depending on the job. Aggregations are right in the sql wheelhouse.

jde commented 8 years ago

The pyspark.sql solution looks promising.

I think there are two 'phases' here:

calculate a series of scores per comment, aka readability, grade level, etc...
on a timer update the aggregations using the map reduce code

This will minimize the amount processing per import and allow us to continue to refine aggregations in the background.

alexbyrnes commented 8 years ago

Sure since the reading levels aren't going to change we can do it in two phases. Updated or new comments would get the single metrics and then we can aggregate them per user or user/section or whatever the stat needs are.

jde commented 7 years ago

What is the best way to incorporate this work into our products? Should Pillar hit Atoll or should atoll access the db directly?

alexbyrnes commented 7 years ago

That's a good question. This is how I think it breaks down for the non-streaming version.

Streaming process

Add a small batch of new comments to the current aggregation based on a stored "denominator." For a mean it's literally the previous denominator. For a count it's the current count. Median/mode it would get more complicated. Luckily we have mostly counts and averages of more complex per-comment metrics.

meanStreaming(current_mean, old_denominator, new_metrics_sum, number_of_new_metrics) = ((current_mean * old_denominator) + new_metrics_sum)/(old_denominator + number_of_new_metrics)

Non-streaming process

Re-aggregate all comments for a set of users with new/edited comments within a timeframe. Collect distinct users over 5 minutes and re-aggregate all their comments. This is inefficient because you may not need all the information from all comments. In other, more complex aggregations, you may need all or most of the information.

This has the advantage of always working, and we don't need to keep track of different denominators for each aggregation function.

For now I think the non-streaming version works fine. It's in a very scalable format so that could continue for a long time. It is also quite a bit less code.

On integration, the choice is pretty flexible. I would recommend keeping the mongo piece separate so the aggregation engine is persisting a metric name and value to a json path instead of connecting directly (as long as this stays performant). For now that could be as easy as a library or small service that could eventually validate the entries or perform non-aggregation functions. @gabelula and I talked briefly about having a central gateway where all data, even aggregations, enter.

In my experience running the aggregation engine, it matters a lot how many metrics you insert at a time and whether or not you use bulk updates. It is vastly faster to update one set of aggregations at a time (as many updates as a user has assets, sections, statuses etc) compared to one field at a time. And it should be much faster yet to update the whole statistics field for a user at once. Crawling the whole tree for one user to validate it, or validate it against what was intended by the aggregation engine would be fastest, and most complex for the gateway app.

(The gateway could also accept individual metrics since it doesn't need to expand storage associated with a document like mongo would, validate individually, and update mongo in batches with bulk update.)

This only applies to pre-aggregated stats. For ad hoc stats, or custom one off stats, a process could be kicked off for the data in question and it could run in mongo or use the aggregation engine a la carte.

jde commented 7 years ago

Streaming v batch The Streaming process sounds like what we currently have in the Go stack. The model there is that it streams values into aggregators one at a time. Currently it is set up to do it in batch, but by saving the aggregators and loading them before processing the next batch.

We came to the same conclusion as you do that this process is so fast that the complexity of streaming isn't necessary, even for many millions of records.

Gateway I'm not quite following concerning the gateway for all data. Let's take that offline.

Number of Metrics One of the advantages of the new architecture that we're demoing is that we will be able to define aggregations in configuration based on graph concepts. This will break us out of the current paradigm where our algorithms are tightly coupled with a static schema.

Earlier in the project I did a test that extracted tens of millions of data points from nytimes data. Writing that data and querying against it performed remarkably well with Mongo. Looking forward to pushing that limit further.

alexbyrnes commented 7 years ago

I'm completely with you we shouldn't have the stats depend on a static schema. Or the schema should need to be specified in one place that can be changed easily.

We've tried to nail down the concept of coral having a schema before about whether the schema is changed by a service like sponge before it reaches coral, or if new schemas for any particular source system (like discus) will be absorbed into coral intact. With respect to Trust, the concerns might be different because as it is, Trust can work outside of another system with a different schema. The schema is absorbed by sponge. For Talk, or a Talk/Trust install the situation might be different because Talk could use enough of the Trust schema to work without sponge.

ETL version

We specify a schema for the whole ecosystem and ETL any data from a foreign system into that schema. The choice of foreign system is flexible.

Flexible Coral Schema

The coral schema is static for one execution of the programs. Migrating the Coral data and changing the schema is easy. In this case a schema from a source system could be used within Coral without many changes, or the Coral schema could be changed over time without breaking current installs.

No schema necessary

Coral works without depending on any particular fields or database structures like specific mongo collections. The code runs regardless of what specific data is in the database.

My understanding is Coral is currently in the ETL area with sponge. It does depend on specific collections like users, dimensions, tags etc. There are some required fields but it's generally flexible. For instance, for user statistics, you can specify a collection and paths in a config file. It's between ETL and Flexible Coral Schema. The long term goal is to get to either no schema necessary, or flexible schema where coral design could change over time, users would be able to migrate if necessary, and new installs require minimal data munging to get started.

Graph database analytics. I would be interested in hearing more about this. I think a distinction between versions is in order here too.

There are two components to our aggregations, probably all aggregations: the function applied to the data, and the type of aggregation. For instance, word count and an aggregation of average to get average word count. Are you proposing that both get done in the graph database? Is the idea the types of analytics we would be able to do in a graph database or that the graph database allows us to specify aggregations in configuration?

jde commented 7 years ago

+1 on the Flexible Coral Schema

My hope for this platform prototype is that we can describe each time in configuration. (aka: https://github.com/coralproject/xenia/blob/item-apis/internal/item/ifix/types.json this early draft).

I hope that this 'universal configuration' can drive the whole ecosystem from one place:

fe creates forms
sponge ingests data
shelf manages relationships / creates views
xenia exposes query endpoints
fe accesses the data through the endpoints

So it becomes trivial to update what's happening. Stats would be driven by this as well.

alexbyrnes commented 7 years ago

Universal configuration vs data gateway might be the same thing but one is the config and the other is the thing that reads/uses the config.

Do you have something similar to my stats replacement or you're saying pending the poc?

jde commented 7 years ago

Let's take the conversation about the "stats replacement' off line. I wasn't aware that your were working toward replacing stats (which is already fully developed and running in production.) We are approaching the full backend prototype (sponge -> shelf -> xenia) at which point we can address the best strategy for the next generation of stats.

alexbyrnes commented 7 years ago

Stats is running on what we call the staging server, which is in production, see above for details. There are still improvements to be made to it according to this and other tickets. That's what I mean by replacement. Work is cataloged in coralproject/stats, this ticket, and #91. This is from the grab-open-tickets on pillar era, which I think is being replaced by the pillar-is-deprecated era.

statsd and rabbitmq are also part of Pillar currently but we've discussed replacements for those, as well as replacements/refactoring of the pillar implementation #182. Are there reasons to keep the current stats implementation?

jde commented 7 years ago

Still not clear on the way you're describing the environments. I'd call and consider the server a staging environment but I'll leave that to your discretion.

There are improvements that need to happen with Pillar before Trust can be called complete. These need to be our top priority.

I would say there's no reason to replace stats. It is proven in production, lightening fast and quick to update. The architecture is an upgrade away from real-time streaming (but with it's ability to do millions of records in minutes it's not a high priority.)

alexbyrnes commented 7 years ago

I was under the impression stats should be decoupled from pillar and implemented with an analytics specific tool like atoll or spark. Nice to hear good things about our work anytime though.

What part of pillar is going to be preserved? As I mentioned, statsd, rabbitmq, stats, wake, are very important parts of our Trust system. The data will also need to be migrated if the backend and/or dbms changes. We're starting to have a lot of data.

Not looking for anything definitive. We will hopefully be able to make changes to our systems to keep up with coral. A statement of the current plan would be extremely helpful though.

jde commented 7 years ago

Stats is decoupled. https://github.com/coralproject/pillar/tree/master/cmd/stats It can be build as a standalone binary and used via the slick CLI interface @gabelula set up.

The plan is the same that we've been talking about in meeting and channels:

migrate pillar from here into the xenia repo, refactoring it into a sustainable architecture
prove out concepts of sponge -> shelf -> xenia as a potential graph based replacement for the static apis (part of this will be trying out the iron.io model of service workers to see how it feels running beside or instead of rabbit.)

If we do have to migrate data, the sponge cli is a great tool to do so.

alexbyrnes commented 7 years ago

Ok I think this might need to be pinned for later. I appreciate the reply. On board for looking at new concepts in sponge -> shelf -> xenia. It's still a little unclear but understandably this is a time when big decisions are being made.

coralproject / pillar

Make stats aggregations real time #88