In my naive earlier attempts to perform deduplication, I stored reddit posts grouped under the author's username as a key. This will eventually lead to an issue where a record's size grows large enough to evict a large portion of stored submissions in Redis. There's no good way to query or scan for records in Redis by value, so it doesn't make a lot of sense to use submission ID by itself as a key in Redis since that doesn't give enough context to limit a search to remain performant.
Proposed change
Change it so that the key becomes a combination of author + post ID:
user = foo, id = abc123
# becomes
key = foo/abc123
Then the deduplication logic would change from just getting a set and iterating over the members, to using Redis's SCAN command, where it tries to match on the author, like SCAN 0 MATCH foo/*
Problem
In my naive earlier attempts to perform deduplication, I stored reddit posts grouped under the author's username as a key. This will eventually lead to an issue where a record's size grows large enough to evict a large portion of stored submissions in Redis. There's no good way to query or scan for records in Redis by value, so it doesn't make a lot of sense to use submission ID by itself as a key in Redis since that doesn't give enough context to limit a search to remain performant.
Proposed change
Change it so that the key becomes a combination of author + post ID:
Then the deduplication logic would change from just getting a set and iterating over the members, to using Redis's
SCAN
command, where it tries to match on the author, likeSCAN 0 MATCH foo/*