colbywhite / why82

http://why82.com/
Other
1 stars 0 forks source link

Start saving stats during the daily updates. #56

Closed colbywhite closed 8 years ago

colbywhite commented 8 years ago

Stats like pace of play (#48) and ediff (#46) are going to require a change in how I'm calculating stats. For the overall record, I have a heavy duty SQL call involving views to calculate the overall record based on what's in the DB. The L10 record is done via a fairly light SQL cal to grab the last 10 games, and then in-memory calculation. Neither is ideal.

The better way is to have a table (or multiple tables) that hold that stuff. When my daily job goes out to update the scores, it can also update certain stats and save them. Then the metric logic can use that instead of calculating it dynamically. It's a performance issue that probably won't scale well with the metrics.

For a lot of metrics, this won't necessarily be a performance issue, but an impossibility. It'll be the only way to calculate what I need.

This also may help solve another issue. Right now, if you look at a game in the past, you will grade it based upon the tier information of the present. Instead, you'll want to grade that game based on the tiers of that day. Based on how the stats table(s) is (are) implemented, it might be a way of solving that issue. But that's a secondary concern.

colbywhite commented 8 years ago

So here's where I'm at after some researching this.

I think my algorithm would have to something like the following:

SeasonUpdater.update
    ActiveRecord::Base.transaction do
        if GameUpdater.update
            MetricUpdater.update
    end

MetricUpdater.update
    days_to_update = todays_date - date_last_updated
    days_to_update.each do |date|
        data = {}
        data_source.each do |source|
            data[source.name] = pull_data_from_source source, date
        end

        entries = {}
        season.teams.each do |team|
            # for every team, create an empty entry for the date
            # TODO: use the preexisting entry if there is one
            entries[team] = season.stats_class.new team: team, date: date
        end

        metrics.each do |metric|
            # each metric updates the entries hash based on the data pulled for that day
            metric.update_entries entries, data
        end

        entries.values.each do |entry|
            # each team's entry is saved
            entry.save
        end
    end
end 

There is an over-arching MetricUpdater that wraps the common logic and then delegates the logic that actually inserts the values into the row to the individual metrics. That way each metric can know how to grab the data they need. The data sources are split out a bit from the metric since the bulk of them would likely use the same handful of sources. So separating them makes it so each metric doesn't unnecessarily hit the same data source each time.

One red flag is that this seems pretty complicated. Perhaps more complicated than it should be.

But before I could even get into the complications of it, I realized a flaw in using bballref for this. The line pull_data_from_source source, date isn't possible for bballref. They don't have day-by-day stats. For instance, the standings page is always the latest standings. I can't rewind the table to two days ago and see what it looked like on that day. At least not in a manner that wouldn't make the algorithm even more complicated.

But, to be clear, the date aspect is not 100% a part of this ticket. But at some point I'm going to want it. I'm going to want to look at a past game and see what its grade was. I can't do that currently because I will be grading that game based upon the latest tiers. Even more importantly, at the end of the season, I can't go back and answer "How many A games were there?". I'd be looking at the season's games based upon the tiers that the season ended with. So, at some point, I'm going to want/need that feature. I thought I'd be able to do that while I do this ticket. Not doing it with this ticket makes me think I'd be leaving myself a lot of rework later down the line.

That being said, I began looking at this date issue. Which brought me back to googling for APIs. (Again.) I ended up back on the NBA's stat page. I've poked around in that API during one of my previous searches and it is not 100% intuitive and not officially documented anywhere. The only doc out there is the skeleton doc created by the nba.py guys in their attempts to reverse engineer that API. It's a good attempt, but it's pretty poor. It also seems in the same shape that it was in when I first saw it months ago. No one has updated it since Sept. 2015. So ... yea. Nothing new there.

But in playing with the NBA stats page, I came at it from a different angle. This time, I just played with the UI for a bit. It's pretty robust. And when you pair the UI with a Chrome debugger, I actually was able to reverse engineer a couple things from it. That might actually be doable if you know exactly what data you want from the API. First get to it in the UI, then watch the call in the debugger. I don't think that worked to well for me the first time because I was using the nba.com homepage to look for game schedules. But the stats.nba.com is a better place to start reverse engineering that API. That's probably where the nba.py guys started.

So where does this leave me? Well, I don't know. I mean, there's an API that maybe I can use, but it won't be easy. It will actually be harder to reverse engineer that undocumented, unintuitive API than to just screen scrape BBallRef. But of course, that leaves me without the date thing. Also, the NBA API has a lot more data in it, if you can get to it. There's also the nagging thought that none of this has anything to do with what this ticket started out to be. This may be classic feature creep on my part. But if I need to switch data sources in the near future in order to get a feature I know I'm going to want/need, then maybe I should look into it.

colbywhite commented 8 years ago

From the terms of use:

By using such NBA Statistics, you agree that: (1) any use, display or publication of the NBA Statistics shall include a prominent attribution to NBA.com in connection with such use, display or publication; (2) the NBA Statistics may only be used, displayed or published for legitimate news reporting or private, non-commercial purposes; (3) the NBA Statistics may not be used in connection with any sponsorship or commercial identification; (4) the NBA Statistics may not be used or referred to in connection with any gambling activity (including legal gambling activity); (5) the NBA Statistics may not be used in connection with any fantasy game or other commercial product or service; (6) the NBA Statistics may not be used in connection with any product or service that presents a live, near-live or other real-time or archived play-by-play account or depiction of any NBA game; and (7) the NBA Statistics may not be used in connection with any web site, product or service that features a database (in any medium or format) of comprehensive, regularly updated statistics from NBA, WNBA or D-League games, competitions or events without the Operator's express prior consent.

Point 7 might take me out right off the bat. When I was going through the above brainstorm, the thought did cross my mind about why am I even using a DB if all the data is available in an API that doesn't have rate limits. Most people would call that heresy. :smiling_imp:

EDIT: After re-reading, I think the term comprehensive keeps me in the clear. No way am I even attempting to be comprehensive. Just a handful of stats.

colbywhite commented 8 years ago

Another note: I don't think switching data sources changes that algorithm. And actually, I should probably stop saying "switching" data sources. No reason I can't keep hitting BBallRef in order to get game scores and use the NBA API for the metrics. It's awkward. But, in theory, I'd rip out the BBallRef once I figure out how to get the game schedule out of the NBA API.

colbywhite commented 8 years ago

I should have commented on this earlier, but when I added #45, I did it via doing some calculations in code. Once deployed, the page takes longer to load up because of it. It's not earth-shattering, but I noticed it. So the performance issue of this is already alive, although in a small dosage.

colbywhite commented 8 years ago

After researching or #59, using the NBA API, as difficult as it is to use, is probably the right choice for this. Getting a stat as it was on any particular day in the season is a first-class citizen with the API. That can't be said for BballRef.

As for the data sources, the leagueteamstats is probably the thing I need/want. Between the traditional stats and the advanced, that'll satisfy this milestone.

For this milestone I can get three of the four stats I want via a single call to the traditional stats. For the record in the last 10 games, I'll need a separate call with a different parameter. So I'll either need the concept of multiple datasources, or just have each metric get the data itself and eat the cost of repeating calls. For this milestone, I'll be hitting it four times instead of two. That sounds doable for now. On second thought, this doesn't matter a whole lot because that above algorithm is happening on a async worker box, not a web box responding to a UI call. So no one will see the extra latency of four calls compared to two. So that above algorithm should just nuke the concept of multiple data sources. It won't affect anything. (:relieved: that should simplify it a bit.)

OK. So let's update the algorithm a bit:

SeasonUpdater.update do
    ActiveRecord::Base.transaction do
        if GameUpdater.update
            MetricUpdater.update
        end
    end
end

MetricUpdater.update do
    days_to_update = todays_date - date_last_updated
    days_to_update.each do |date|
        entries = {}
        season.teams.each do |team|
            # for every team, get the day's entry or create one
            stat_entry =  season.stats_class.find_or_create_by team: team, date: date

            # each metric pulls its third-party data for that day (and caches it so it's only loaded once)
            # then it updates the team's corresponding stat entry
            metrics.each do |metric|
                metric.update_entry date, stat_entry
            end

            # Now that each metric has updated the entry, the entry should have all the data we want in it
            # so lets save it
            stat_entry.save
        end
    end
end

Removing the multi-datasource concept and adding some data caching for each metric makes the algorithm simpler.

Feeling better about this.

colbywhite commented 8 years ago

:sigh: The LastNGames part of the API, which should be giving me the stats over the last N games, doesn't really work as expected. When you combine it with a date, which I would expect to give me the stats over the last N games at specific point in time, the stats get all funky. It seems to act more like a last N days type of param, though even that seems off by a few days. Annoying.

So I'll have to rethink how I do getting the L10 record. Hopefully I can find it somewhere else in the API.

colbywhite commented 8 years ago

pivoting and moving toward a serverless model.