Reddit comments-per-minute stats normalized to time-of-day and day-of-week

lgommans commented 1 year ago

The fluctuations in usage on different times of the day is pretty huge, and there are regularly two days (weekends I presume, I haven't checked) that also see lower usage than the other days. The 'reduce noise' button helps a little, but doesn't filter this out.

original graph as on the site

I went ahead and bashed some Python onto the JSON that the API returned to see if I can figure out what happened to reddit usage amounts.

graph showing a drop during the protest, almost back to normal right after, and slowly climbing up to pre-protest levels until stabilising at the end of June. There are occasional spikes up and down, presumably outages or when the scraper didn't fire at the right second

Horizontal: day of month
Vertical: comments per minute (CPM), relative to the average CPM of that 5-minute window of that day-of-week

For example, if a typical Monday 00:00–00:05 sees 5500 CPM, then a Monday 00:00–00:05 with a CPM of 6000 will be drawn on the graph as +500. Five minutes was chosen as a middle ground after hourly was found to be way too coarse and minutely caused noise from being too fine-grained.

Visualisation [OC]™ done in LibreOffice Calc by just copying the printed data into the spreadsheet and formatting the graph a little.

Limitations:

Code is not well-tested; may have mistakes that skew the data
Baseline data available only for a few days before the protest, so the averaging had to also be taken from post-protest data. If this influenced the results greatly, then the before-protest CPM ought to show up as through the roof, which they're not. Excluding the protest days from averages counting made things also makes things look more sensible, but since there was no hard end day, it's not very precise.
Not sure if average is the right method. Trying the median, everything gets more spikey, though
Notice the small gap just before the downwards spike towards midnight on the 11th: there was an outage here which pushed the CPM off the charts (downwards). Since that's not representative of how many users were trying to engage with reddit (what I'm trying to establish), I deleted those handful of data points. Only lasted an hour anyway, not much lost on the scale of a month.

Conclusions and discussion:

The protest had an easily measurable impact on reddit usage, which is as much as I could have hoped for. (Note that the total/absolute CPM fluctuates between ~3k at night somewhere-on-earth and ~7k at peak. At its deepest, the protest reduced CPM by about 1.8k.) Having one's opinion heard is what a protest does, so mission accomplished there.
Not all subreddits coming back online on the 14th makes it hard to say whether users were boycotting reddit of their own volition (my camp), or if they simply couldn't post comments in the places they'd ordinarily frequent. Either way, there was a prolonged effect.
Around June 27–28, the CPM value is back to what it was before the protest started: just above relative zero. (That means the averages err slightly on the low side, which is expected per the limitations.) This is also when I decided to open reddit again, to show in the stats that I would still be an active user if I could use RIF.
Most notably and most worryingly, people being cut off on June 30th made no distinguishable difference whatsoever. The value is still hovering just above relative zero. (I've heard figures of around 1/3rd of reddit traffic being via third-party apps before.) I haven't opened reddit since RIF stopped working, but apparently:
This fuss has either drawn more attention and eyeballs to reddit; gotten people more engaged with reddit; made ~all third-party app users (besides a handful like myself) go first-party or redreader; or a combination thereof. It does not seem as though there are many newbie users, and I can't think of a reason why engagement figures should have changed much, so...
I'm afraid this data forces me to conclude that they were right to just quell the protest and stay their course. People don't care about having to use shitty software if that gets them their network effect back and spez can anger anyone on reddit without repercussions for reddit (see also the AMA which didn't do them any favors whatsoever). I wonder what would happen if they had to pay literally 1 cent (in a currency of your choosing) to keep using reddit, though!

Aaanyway, sorry this turned into a real monologue, I really actually just wanted to share my data, methods, and what conclusions could be drawn from it. Perhaps it's interesting to include the graph in the project readme, or to include a similar normalisation feature in the site itself?

My code, at least in its current state (not meant for publishing tbh, but if someone finds it easier to tweak this than to rewrite it from scratch, here it is)

```bash curl https://blackout.photon-reddit.com/api/all > all ``` ```python import json, datetime, math data = json.load(open('all', 'rt')) # {ppm: [{x:..., y:...}, ...], cpm: [{x:..., y:...}, ...], subs: ...} # posts per minute, comments per minute, subreddits # x is a unix timestamp, y the value dows = { 0: 'Sunday ', 1: 'Monday ', 2: 'Tuesday ', 3: 'Wednesday', 4: 'Thursday ', 5: 'Friday ', 6: 'Saturday ', } def tstokey(ts): k = datetime.datetime.fromtimestamp(ts).strftime('%w %H:') m = int(datetime.datetime.fromtimestamp(ts).strftime('%M')) k += str(math.floor(m / 5)) return k hoddow = {} lastts = None for obj in data['cpm']: ts = obj['x'] / 1000 if ts > 1686434400 and ts < 1686736800: # don't include the blackout in averages calculation continue k = tstokey(ts) if k not in hoddow: hoddow[k] = [] hoddow[k].append(obj['y']) if lastts is not None: #print('DIF', datetime.datetime.fromtimestamp(obj['x'] / 1000).strftime('%a'), obj['x'] / 1000 - lastts) pass lastts = obj['x'] / 1000 #print('Averages:') for k in sorted(hoddow.keys()): dow = dows[int(k.split(' ')[0])] #print(dow, k[2 : ], sum(hoddow[k]) / len(hoddow[k])) print('Deviations:') lastK = None nowavg = [] for obj in data['cpm']: ts = obj['x'] / 1000 k = tstokey(ts) now = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M') if lastK != k: if lastK is not None: try: print(str(now) + '\t' + str(round(sum(nowavg) / len(nowavg) - sum(hoddow[k]) / len(hoddow[k])))) except KeyError: pass lastK = k nowavg = [] nowavg.append(obj['y']) ```

ArthurHeitmann commented 1 year ago

Very nice work! I think adding something like that to the website would be a good idea, once I have the time for it.

It is technically possible to get per minute data from before I started recording, by checking in fixed ID intervals at what time a thing was posted. I did something similar in this script, to fill a half day gap in my data. Then the question is how far back can you go, before monthly or yearly fluctuations start impacting the data.

The code looks good to me. Small note though, for naming either use snake_case or camelCase, but not lowercase :) But the most important things is that it works, which it does.

The protests indeed only had a measurable impact for a couple of days. What I like in your data, is that you can see the activity easing to 0 up until the 18th or so.

Regarding the data starting from July 1st, as you said the data so far only suggests a very small if not non measurable impact. But so far, against expectation, most 3rd party apps are still working (except for Apollo). No one really knows why. Maybe reddit hasn't pushed out the new rate limiting changes. Hard to say. According to my calculations any app that has more than 1k or 2k daily users should be rate limited.

We also have to remember that is is just one statistic. Unfortunately the we don't have access more important statistics, like voting activity, daily active users, average time spent on the site, ad revenue, etc. At least not in real-time.

Anyways, thank you for your work, I'll try to incorporate it at some point in the not so distant future. Or if someone else wants to do it, just let me know beforehand.

lgommans commented 1 year ago

But so far, against expectation, most 3rd party apps are still working

I did not know this! I just know that my girlfriend said RIF is dead (around midnight on June 30th) and when I then checked mine, it also showed some http error status code on a blank background. Checking now, I have to login again (which I can't, as it didn't let me either set an email address or choose a password while being logged into RIF; on redreader, my lost-password account is still logged in), but so yeah it works I guess. That explains why there is not even a small drop in usage visible!

Looks like I ran into an error of assumption! Not sure if I want to go back to using reddit though, at least so long as they don't update their public statement to be clear on what course they're going to sail.

Thanks also for looking into the code and its correctness. (Yes, better variable names is my usual style, this script I made in /tmp and considered one-time use / write-only: didn't think at the time of writing that I'd be sharing it.)

Using that technique to fill the gaps is clever, we could indeed use that to fill in, say, two or three weeks before the first subreddits went dark to have some baseline usage figures and call it good enough. Though I don't know if it's worth doing further analyses so long as they aren't cutting anyone off in the first place.

ArthurHeitmann commented 9 months ago

It's been some time now. And with some more data, I don't think displaying a deviation from the mean is possible, since there's quite a bite of variation. But to close this, I've improved the visualization a bit when looking at the full time range, with reduced noise option enabled. It chart now shows some more long term developments.

ArthurHeitmann / reddit_site_stats

Reddit comments-per-minute stats normalized to time-of-day and day-of-week #6