Closed gavinwahl closed 1 week ago
Instead of storing all messages
storing all messages would be too much data.
we keep a count of messages per user per hour per channel
even this seems to be more than necessary for the particular reports.
defer the choice of timezone to later
Boost is used worldwide. There are users in all time zones. It would be expected to use UTC time.
Could this be reduced to a daily count (and then per user per channel). Run the job once per day.
As Rob wrote:
and message count that gets run on a daily basis
Could this be reduced to a daily count
Yes.
Now that we have data, what reports do we want to do with it? Where on the website will they show up?
well, I think some of that may also be based on what kind of data is available to us. At least for a start, we want to do a message count for each person between releases. Those values would need to be stored somewhere so they could be imported into the release report that Brian is working on.
The Slack API provides an efficient way to fetch new messages directly in a channel, but not messages in threads. A separate API call for each thread must be done to fetch messages in that thread, and there's no way to know if there are new messages in that thread without doing the call.
This PR handles threads by storing a reference to every thread ever encountered, then checking if there are new messages in any thread on every update. The #general channel has approximately 20,000 threads, which at a rate limit of about 1 request/second, will take about 6 hours to retrieve.
Here are the options I see possible:
@rbbeeston which approach would you like to pursue?
what do you think about slack provided analytics? See screenshots:
The analytics report is not available programmatically on our slack plan (only Business+, Select/Compliance, or Grid). Also, it only has data for the past 13 months.
Updated to store activity in daily (UTC) buckets instead of hourly, and to store the URL of users' avatars.
Rob and I decided on option 1
Updated to only update recent threads and to allow resumption after being interrupted. Ready for merge.
"fetch_slack_activity is resumable"
updated with @GregKaleka's suggestions
Instead of storing all messages, we keep a count of messages per user per hour per channel to allow further aggregation later. Incremental updates are supported, fetching only new messages since the last update. However, thread messages do not show up in the main message list so message history for every thread ever encountered has to be checked every time.
Hourly buckets are chosen to defer the choice of timezone to later. This will allow aggregation and display to be done in any timezone with a whole-hour UTC offset.
Automatically sleeps when encountering rate limiting, so while it make take a while, it will finish successfully. Initial run time for the #boost-website channel was 2 minutes 7 seconds.
With the data collected, we can generate this overall activity report:
Or similar reports for any time range that ends on hour boundaries.
refs #1367