boostorg / website-v2

New Boost website
https://boost.io
Boost Software License 1.0
8 stars 13 forks source link

Download slack activity #1395

Closed gavinwahl closed 1 week ago

gavinwahl commented 3 weeks ago

Instead of storing all messages, we keep a count of messages per user per hour per channel to allow further aggregation later. Incremental updates are supported, fetching only new messages since the last update. However, thread messages do not show up in the main message list so message history for every thread ever encountered has to be checked every time.

Hourly buckets are chosen to defer the choice of timezone to later. This will allow aggregation and display to be done in any timezone with a whole-hour UTC offset.

Automatically sleeps when encountering rate limiting, so while it make take a while, it will finish successfully. Initial run time for the #boost-website channel was 2 minutes 7 seconds.

With the data collected, we can generate this overall activity report:

SELECT real_name, SUM(b.count)
FROM slack_slackactivitybucket b,
     slack_slackuser u
WHERE b.user_id = u.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
          real_name           | sum
------------------------------+------
 Vinnie Falco                 | 3076
 Rob Beeston                  |  990
 Joaquín M López Muñoz        |  619
 Sam Darwin                   |  534
 René Ferdinand Rivera Morell |  323
 Kenneth Reitz                |  226
 Alan de Freitas              |  179
 Spencer Strickland           |  143
 Julio C Estrada              |  136
 Peter Dimov                  |  119

Or similar reports for any time range that ends on hour boundaries.

refs #1367

sdarwin commented 3 weeks ago

Instead of storing all messages

storing all messages would be too much data.

we keep a count of messages per user per hour per channel

even this seems to be more than necessary for the particular reports.

defer the choice of timezone to later

Boost is used worldwide. There are users in all time zones. It would be expected to use UTC time.

Could this be reduced to a daily count (and then per user per channel). Run the job once per day.

As Rob wrote:

and message count that gets run on a daily basis

gavinwahl commented 2 weeks ago

Could this be reduced to a daily count

Yes.

Now that we have data, what reports do we want to do with it? Where on the website will they show up?

rbbeeston commented 2 weeks ago

well, I think some of that may also be based on what kind of data is available to us. At least for a start, we want to do a message count for each person between releases. Those values would need to be stored somewhere so they could be imported into the release report that Brian is working on.

gavinwahl commented 2 weeks ago

The Slack API provides an efficient way to fetch new messages directly in a channel, but not messages in threads. A separate API call for each thread must be done to fetch messages in that thread, and there's no way to know if there are new messages in that thread without doing the call.

This PR handles threads by storing a reference to every thread ever encountered, then checking if there are new messages in any thread on every update. The #general channel has approximately 20,000 threads, which at a rate limit of about 1 request/second, will take about 6 hours to retrieve.

Here are the options I see possible:

  1. This PR. The update will finish eventually. We might have to update less frequently than daily if the process across all channels takes longer than 1 day.
  2. Only consider threads newer than some threshold, assuming most threads are short-lived. New messages in threads older than the threshold would not be counted. The first import will still have to check all threads ever, but after that we could only check, for example, threads created in the last month (currently 75 in #general).
  3. Use the Slack "Export workspace data"1 functionality. Write a script to import the whole export without doing any API calls. Someone would have to do this manually every time there's a release. They would have to be a Slack admin to get the export, then either use a management command or web upload to process the file. This is probably the most robust approach as long as there's someone to do the export/import process.
  4. Import historical data once, either using this PR or the script from (2). Then, get notified about messages with the Slack "Events API"2, which is a webhook that they send whenever a new message is posted, including ones in threads. There would be a period of time between importing the export and installing the webhook during which messages wouldn't be counted (if the webhook is installed after exporting) or double counted (if the webhook is installed before exporting). This approach has the disadvantage that webhooks make testing difficult. There would also be multiple code paths to maintain, one for historical data and one for the webhook.

@rbbeeston which approach would you like to pursue?

sdarwin commented 2 weeks ago

what do you think about slack provided analytics? See screenshots:

Screenshot from 2024-11-05 16-03-09 Screenshot from 2024-11-05 16-02-40

gavinwahl commented 2 weeks ago

The analytics report is not available programmatically on our slack plan (only Business+, Select/Compliance, or Grid). Also, it only has data for the past 13 months.

gavinwahl commented 2 weeks ago

Updated to store activity in daily (UTC) buckets instead of hourly, and to store the URL of users' avatars.

gavinwahl commented 1 week ago

Rob and I decided on option 1

gavinwahl commented 1 week ago

Updated to only update recent threads and to allow resumption after being interrupted. Ready for merge.

vinniefalco commented 1 week ago

"fetch_slack_activity is resumable"

gavinwahl commented 1 week ago

updated with @GregKaleka's suggestions