ArthurHeitmann / reddit_site_stats

Tracking and visualizing the scale of the reddit blackout
https://blackout.photon-reddit.com
25 stars 1 forks source link

Reddit comments-per-minute stats normalized to time-of-day and day-of-week #6

Closed lgommans closed 9 months ago

lgommans commented 1 year ago

The fluctuations in usage on different times of the day is pretty huge, and there are regularly two days (weekends I presume, I haven't checked) that also see lower usage than the other days. The 'reduce noise' button helps a little, but doesn't filter this out.

original graph as on the site

I went ahead and bashed some Python onto the JSON that the API returned to see if I can figure out what happened to reddit usage amounts.

graph showing a drop during the protest, almost back to normal right after, and slowly climbing up to pre-protest levels until stabilising at the end of June. There are occasional spikes up and down, presumably outages or when the scraper didn't fire at the right second

Horizontal: day of month
Vertical: comments per minute (CPM), relative to the average CPM of that 5-minute window of that day-of-week

For example, if a typical Monday 00:00–00:05 sees 5500 CPM, then a Monday 00:00–00:05 with a CPM of 6000 will be drawn on the graph as +500. Five minutes was chosen as a middle ground after hourly was found to be way too coarse and minutely caused noise from being too fine-grained.

Visualisation [OC]™ done in LibreOffice Calc by just copying the printed data into the spreadsheet and formatting the graph a little.

Limitations:

Conclusions and discussion:

Aaanyway, sorry this turned into a real monologue, I really actually just wanted to share my data, methods, and what conclusions could be drawn from it. Perhaps it's interesting to include the graph in the project readme, or to include a similar normalisation feature in the site itself?

My code, at least in its current state (not meant for publishing tbh, but if someone finds it easier to tweak this than to rewrite it from scratch, here it is) ```bash curl https://blackout.photon-reddit.com/api/all > all ``` ```python import json, datetime, math data = json.load(open('all', 'rt')) # {ppm: [{x:..., y:...}, ...], cpm: [{x:..., y:...}, ...], subs: ...} # posts per minute, comments per minute, subreddits # x is a unix timestamp, y the value dows = { 0: 'Sunday ', 1: 'Monday ', 2: 'Tuesday ', 3: 'Wednesday', 4: 'Thursday ', 5: 'Friday ', 6: 'Saturday ', } def tstokey(ts): k = datetime.datetime.fromtimestamp(ts).strftime('%w %H:') m = int(datetime.datetime.fromtimestamp(ts).strftime('%M')) k += str(math.floor(m / 5)) return k hoddow = {} lastts = None for obj in data['cpm']: ts = obj['x'] / 1000 if ts > 1686434400 and ts < 1686736800: # don't include the blackout in averages calculation continue k = tstokey(ts) if k not in hoddow: hoddow[k] = [] hoddow[k].append(obj['y']) if lastts is not None: #print('DIF', datetime.datetime.fromtimestamp(obj['x'] / 1000).strftime('%a'), obj['x'] / 1000 - lastts) pass lastts = obj['x'] / 1000 #print('Averages:') for k in sorted(hoddow.keys()): dow = dows[int(k.split(' ')[0])] #print(dow, k[2 : ], sum(hoddow[k]) / len(hoddow[k])) print('Deviations:') lastK = None nowavg = [] for obj in data['cpm']: ts = obj['x'] / 1000 k = tstokey(ts) now = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M') if lastK != k: if lastK is not None: try: print(str(now) + '\t' + str(round(sum(nowavg) / len(nowavg) - sum(hoddow[k]) / len(hoddow[k])))) except KeyError: pass lastK = k nowavg = [] nowavg.append(obj['y']) ```
ArthurHeitmann commented 1 year ago

Very nice work! I think adding something like that to the website would be a good idea, once I have the time for it.

It is technically possible to get per minute data from before I started recording, by checking in fixed ID intervals at what time a thing was posted. I did something similar in this script, to fill a half day gap in my data. Then the question is how far back can you go, before monthly or yearly fluctuations start impacting the data.

The code looks good to me. Small note though, for naming either use snake_case or camelCase, but not lowercase :) But the most important things is that it works, which it does.

The protests indeed only had a measurable impact for a couple of days. What I like in your data, is that you can see the activity easing to 0 up until the 18th or so.

Regarding the data starting from July 1st, as you said the data so far only suggests a very small if not non measurable impact. But so far, against expectation, most 3rd party apps are still working (except for Apollo). No one really knows why. Maybe reddit hasn't pushed out the new rate limiting changes. Hard to say. According to my calculations any app that has more than 1k or 2k daily users should be rate limited.

We also have to remember that is is just one statistic. Unfortunately the we don't have access more important statistics, like voting activity, daily active users, average time spent on the site, ad revenue, etc. At least not in real-time.

Anyways, thank you for your work, I'll try to incorporate it at some point in the not so distant future. Or if someone else wants to do it, just let me know beforehand.

lgommans commented 1 year ago

But so far, against expectation, most 3rd party apps are still working

I did not know this! I just know that my girlfriend said RIF is dead (around midnight on June 30th) and when I then checked mine, it also showed some http error status code on a blank background. Checking now, I have to login again (which I can't, as it didn't let me either set an email address or choose a password while being logged into RIF; on redreader, my lost-password account is still logged in), but so yeah it works I guess. That explains why there is not even a small drop in usage visible!

Looks like I ran into an error of assumption! Not sure if I want to go back to using reddit though, at least so long as they don't update their public statement to be clear on what course they're going to sail.

Thanks also for looking into the code and its correctness. (Yes, better variable names is my usual style, this script I made in /tmp and considered one-time use / write-only: didn't think at the time of writing that I'd be sharing it.)

Using that technique to fill the gaps is clever, we could indeed use that to fill in, say, two or three weeks before the first subreddits went dark to have some baseline usage figures and call it good enough. Though I don't know if it's worth doing further analyses so long as they aren't cutting anyone off in the first place.

ArthurHeitmann commented 9 months ago

It's been some time now. And with some more data, I don't think displaying a deviation from the mean is possible, since there's quite a bite of variation. But to close this, I've improved the visualization a bit when looking at the full time range, with reduced noise option enabled. It chart now shows some more long term developments.

image