FAForever / server

The servercode for the Forged Alliance Forever lobby
http://www.faforever.com
GNU General Public License v3.0
67 stars 62 forks source link

Issue #930 - Add new dimension to matchmaking search duration in prometheus #1010

Closed benjamin-lawson closed 5 months ago

benjamin-lawson commented 5 months ago

Makes #930 possible by adding the queue population dimension to the search time in prometheus.

benjamin-lawson commented 5 months ago

Closing this PR due to discuss in #930 to move this data to the DB rather than prometheus.

Brutus5000 commented 5 months ago

Prometheus seems the right place for analytical data with historical updating. The db should only contain the "current" state.

benjamin-lawson commented 5 months ago

Prometheus seems the right place for analytical data with historical updating. The db should only contain the "current" state.

I think this could use more discussion then as @BlackYps was saying that the DB would be the better location.

Brutus5000 commented 5 months ago

Depends on what is supposed to be done with it. Look at a graph to see the development and derive decision from it? -> Prometheus Use the information at runtime to adapt the matchmaker dynamically -> db

benjamin-lawson commented 5 months ago

Depends on what is supposed to be done with it. Look at a graph to see the development and derive decision from it? -> Prometheus Use the information at runtime to adapt the matchmaker dynamically -> db

My thoughts were to put it into Prometeus to build a model. Once we have that model, we can implement the model predictions into the server code to provide a predicted wait time. I'm not super great with statistics so I'm open to suggestions on tackling this a different way.

BlackYps commented 5 months ago

My idea was to store the date in the db to be able to retrieve up-to-date information. I'll just post my notes about implementation here for you to read:

Implementation idea: have deques with rating brackets and party sizes for each queue whenever a search finishes successfully, store a timestamp, the wait time, rating, newbie count, the average number of people in queue(average of the last ~10 iterations) and the total number of people online in the appropriate deque.

Store the deques in the db once a day and load them on server startup so we don't lose the info on a restart

We could in a first step only collect the info and then get a db dump to analyze the data

For estimates: make a linear fit of all data in a deque look up where a certain party would end up on the linear fit to get the wait time

data could be visualized in a rating - queue size plane with the wait time of data points coded as color

benjamin-lawson commented 5 months ago

I think we'd be better off going for the estimation option rather than trying to worry about accuracy. We know that even with perfect historical data, we are always going to be off to some degree with a prediction. I'd think the linear regression (or some other regression model) would be much faster to implement and have a decent accuracy.

BlackYps commented 5 months ago

Sorry, but I don't understand what exactly you are comparing and arguing against? I made only one suggestion for implementation, not two. Can you clarify what you mean?

benjamin-lawson commented 5 months ago

My bad. I think I misread your comment. I spent some time last night trying to figure out if we can perform this kind of analysis with Prometheus. I'm not super well versed with Prometheus, but I was having a really hard time trying to figure out a query that works. I think @BlackYps may be right here that we should have a dequeue history table in the DB. Here is a structure of the table I had in mind. Any additional columns we could use in this table?

DB Table

benjamin-lawson commented 5 months ago

I took a stab at the database approach. I set up the database table as above. I wrote a stored procedure that takes the number of players in queue and the queue ID and returns the wait time in seconds using linear regression. I've limited it to using 1000 entries as the query is not super optimal. We can fiddle with this number as needed. The image below shows the simulated data in red (random points generated in red) and what the server calculated in blue. I can put up a draft PR for this to demonstrate how this works. My concerns with this approach would be load on the database as the query is not optimal. To mitigate this, we could set up a job to cache the results of this query for each queue and update it at a set interval.

points

BlackYps commented 5 months ago

About the columns: I don't think we need to reference the player. The only relevant info about the player is what the matchmaker uses for matching. And that is the player's rating and if he is considered a newbie. To make the prediction more accurate I wonder if we should store the average queue activity around the time the match is found instead of the queue size at exactly the round the match happens. So basically an average of how many people are in queue over the last N iterations. We could also store the total number of people online at the time to be able to scale the expected wait time better. I don't think it makes sense to store the events where people simply leave the queue. We can't really gather anything from that. Lastly, maybe it is better to store the end time and the wait time instead of end time and start time, but I haven't fully thought the implications of this through.

BlackYps commented 5 months ago

Yes, I agree that we should cache the db query in some way to not constantly query the db

benjamin-lawson commented 5 months ago

These are good suggestions that I want to respond to one at a time to make sure I don't miss any thing.

I don't think we need to reference the player. The only relevant info about the player is what the matchmaker uses for matching. And that is the player's rating and if he is considered a newbie.

I think this is spot on. I was thinking it may be helpful to know the identity of the player, but we really only care about rating of the player. I thought about factoring in average rating of the queue as well into this prediction.

To make the prediction more accurate I wonder if we should store the average queue activity around the time the match is found instead of the queue size at exactly the round the match happens. So basically an average of how many people are in queue over the last N iterations.

I am thinking we probably don't need this information. Really, this is what the linear regression is doing for us. I think storing the event rather than meta information is better as we can still calculate the metadata from the raw events.

We could also store the total number of people online at the time to be able to scale the expected wait time better. I don't think it makes sense to store the events where people simply leave the queue. We can't really gather anything from that.

Yeah, total online players could play a factor, but I would think this could introduce a lot of error. For instance, if we have some event going on where one queue is seeing a lot more traffic than the others, the calculations may confuse larger online player count with shorter queue times. But in reality, this scenario would have the wait times for other queues be pretty much normal.

Lastly, maybe it is better to store the end time and the wait time instead of end time and start time, but I haven't fully thought the implications of this through.

This one I disagree with. Wait time is but a difference of start time and end time. This is the same thing as above where we can calculate the metadata if we need it, but it would be better to store the raw data and calculate from there.

Yes, I agree that we should cache the db query in some way to not constantly query the db.

I think we can do a generalized estimation for queues (given any player, what will be their wait time?) and do a more nuanced estimation once the player joins the queue or something like that.