ADI-Labs / density

wireless density API
13 stars 26 forks source link

Exploratory Data Analysis (Time Series) #197

Open alanhdu opened 8 years ago

alanhdu commented 8 years ago

Use .csv dump and get sense of data.

Questions:

alexander-yu commented 8 years ago

From eyeballing plots of various floors/buildings (IPython notebook link with plots below), there's a couple observations that I've made:

  1. There are some pretty notable seasonal effects to the time series, which implies that our time series aren't stationary.
  2. Specifically, we can see seasonality at the daily, weekly, and annual level (I haven't seen any notable patterns on a monthly basis).

    On the daily and weekly level, the capacities are pretty predictable in terms of relative capacities. That is, there are much more people during the day at around 3 p.m. than in the morning or after 9 p.m., and numbers tend to die down near closing hours for the buildings that do close, and there are usually more people in study spaces in the middle of the week than there are on weekends. For dining halls, it's again pretty predictable; much more people during normal eating periods like early afternoon or around 6 p.m. than in the early morning or near closing hours.

    On the annual level, the seasonal effects tend to correspond more closely with the academic calendar; capacities really die down during holidays and breaks, and there are certainly peaks (especially in libraries) as the year approaches midterms/finals.

  3. It also turns out that people tend not to stay in buildings too much after closing time; for libraries this tends not to be the case anyways, since those libraries with actual closing hours tend to get cleared out by staff (personal experience).

    However, we can still see some people in buildings (Lerner for example) past closing time. For some of these plots, though, it's a bit difficult to tell whether it's a handful of people loitering past hours or it's other devices in the building (like printers/desktops); for example, as the last plot in the IPython notebook shows, on 11/01/14, Avery 2 constantly had some number of devices counted, which I'm guessing are printers or something.

Link to IPython Notebook: https://github.com/afy2103/Density-Data-Analysis/blob/master/density.ipynb

alanhdu commented 8 years ago

@afy2103 Nice work. I've made a pull request (https://github.com/afy2103/Density-Data-Analysis/pull/1) with some technical comments about the analysis. A couple of high-level comments:

alexander-yu commented 8 years ago

Got it; thanks for the edits/comments. Apologies about the code being hacked together as it was (and I should probably get familiar with the pandas documentation more). Should I incorporate the edits in your version of the IPython notebook into mine without merging the commits?

I'll get some autocorrelation plots together, and also address that hole in the data in a later post.

alexander-yu commented 8 years ago

Update:

  1. From the autocorrelation plots (notebook included in the already linked repo), there's definitely the sort of behavior on the hourly/daily level that we can see from eyeballing the plots, though in some cases it's not as statistically significant as we'd like (in particular on the daily level of seasonality). For example, with Butler 3, the autocorrelation plot shows local min/max points at the half-week/full-week points, which is what we would expect, but many of the points are below the threshold for statistical significance (Avery had better results there). The hourly level is better, since we can see significance at the 12-hour and 24-hour intervals, which is exactly what we would expect.
  2. We can see some of the seasonality we'd like to have from the weekly autocorrelation plots, but it's even less significant; a lot of this is probably because we simply just don't have enough data for that (the hole in the data doesn't help with that, since pandas isn't able to compute the autocorrelations past that point).
  3. As for the hole in the data, it seems to be about a month where for some reason there's nothing recorded: also something interesting to note is that suddenly after that hole ends, there's a new group of routers called Butler 301 that starts recording, separately from Butler 3 (the updated IPython notebook, density.ipynb, shows this). Was this a new group of routers put in by Columbia? It looks like it's a separate count from the rest of Butler 3, as the capacity recorded in Butler 3 is now significantly lower than what it's been historically -- does this also affect Density's current capacity estimations?