cloudquery / cloudquery

The open source high performance ELT framework powered by Apache Arrow
https://cloudquery.io
Mozilla Public License 2.0
5.79k stars 507 forks source link

Slack source plugin: Syncing channel histories for many messages can take a long time #5809

Open hermanschaaf opened 1 year ago

hermanschaaf commented 1 year ago

Background

The Slack API has rate limits that mean there is an upper bound to how many messages can be fetched in a given time period.

For a back-of-the-napkin calculation, let's say that a channel has N threads. Let's also assume that threads have fewer than 1000 messages, so no pagination is required. For every thread, we must make (at least) one request, and this is rate limited to Tier 3: 50+ requests per minute. So a channel with N threads will take N / 50 minutes to sync, at best. An active channel with around 10 people can easily generate around 3k threads per year (10 threads per day), so this will optimistically take around 1 hour to sync for every year of history. The rate limits apply across all channels, so if there are hundreds of channels like this, it would not be possible to sync them all within 24 hours.

Potential Solutions

I see two potential solutions, not mutually exclusive:

  1. Add support for cursor-based syncing (Link to Plugin-SDK issue). This will allow us to then sync the full history of channels once, then fetch only the new messages in future syncs. Even if the first sync takes several hours or days but is then fast thereafter, this might be acceptable to most users.
  2. Add support to sync from a Slack data dump. Slack allows the export of workspace data into a series of JSON files. These seem to mirror the API responses pretty closely, so it should be possible to load them into a CloudQuery destination using similar methods that we use for syncing from the API today. I see two issues here:
    • this is an admin-only API, and (as far as I can tell) user API tokens have been deprecated by Slack. Bots are not permitted to perform an export action, and admin users cannot generate an API token (except by extracting it from their cookies, which I don't want to recommend), so this would necessarily be something we ask admin users to do manually via the UI.
    • once imported into a CloudQuery database, future syncs would still overwrite the export unless cursor-based syncing is introduced (as proposed in point 1 above).

I think we should be able to do number 1 first, then add number 2 later as a potential optimization.

Note

I'm raising this issue mostly for awareness, feedback and to get an indication of how much interest there is from the community to fix this. Please 👍 this issue if you are interested!

bbernays commented 1 year ago

Do we need to sync messages?