getsince / test3

1 stars 0 forks source link

maybe use cuckoo/bloom filter or fst+trie for seen_profiles #218

Closed ruslandoga closed 3 years ago

ruslandoga commented 3 years ago

nice and somewhat related (bloom):

nice and definitely related (cuckoo):

ruslandoga commented 3 years ago

When to delete a uuid from a filter? If deletions are scheduled in a job queue, there isn't much of a win over storing all seen uuids into seen_profiles table.

One way to do scheduled delete from filters is to create them per period, like each day -- new filter, and keep only the last 7, and when checking whether a uuid is seen, check filter from each period. Since filters can be small (just to be able to fit a period-worth uuids which is day's feed length * days in period), storage space and lookup speed (* # of periods) shouldn't be a problem.

Periods can also be made to be approximate, like we can always have 3 periods, and vary in-app how often they are rotated. A period can be a day, several days, a week, or even a month. On each rotation the last period is deleted, and a new filter is created in its place.

ruslandoga commented 3 years ago

Another approach is storing 'seens' into an FST, but it'd need to be periodically rebuilt from scratch (current period can be a trie, when current period ends, it can be rebuilt as fst). And scheduling deletins is a problem again.

ruslandoga commented 3 years ago

How to filter 'seens' from feed queries in DB using this approach without copying all data into app? Not clear.