dcppc / centillion

Centillion is the Data Commons search engine. One centillion is 3.03 log-times better than a googol.
https://dcppc.github.io/centillion
MIT License
12 stars 3 forks source link

Feature request: search of Slack channel history #145

Open charlesreid1 opened 5 years ago

charlesreid1 commented 5 years ago

Note that this is different from #144.

This feature request is for the ability to search the history of public Slack channels in the NIH-DCPPC workspace.

We have two options:

OPTION 1: FULL HISTORY: centillion scrapes Slack history and stores a local copy, which it can index, host, and provide permalinks to. If messages are unavailable in Slack (i.e., if more than 10,000 messages have passed), they will not disappear from centillion.

OPTION 2: 10K LIMIT: centillion only indexes the Slack messages that can be accessed through Slack - namely, the last 10,000 messages. If messages are unavailable in Slack, they are unavailable in centillion.

While Option 1 is obviously attractive, it brings a tangle of complications with it - security concerns (large volume of sensitive data), infrastructure needs (backend database, hosting, URLs, domains), frontend work (what interface to use to present Slack conversations to the user, if we can't just link to the messages in Slack), etc.

I would vote for Option 2, unless the need for Option 1 is compelling.

If we do need to go with Option 1, it would probably be better to write a separate (new) software package, which would integrate with Slack and constantly run in the background to update its local database of Slack messages, and call that service (rather than calling Slack directly) from centillion. But again, my preference would be for Option 2.

ctb commented 5 years ago

we pay for slack, so we have access to all the messages.