SaxyPandaBear / food-pics

Scheduled webhook that scrapes Reddit and posts to Discord
MIT License
0 stars 0 forks source link

Deduplicate cross-posts for Reddit food pics #2

Closed SaxyPandaBear closed 2 years ago

SaxyPandaBear commented 2 years ago

The Redis cache helps to alleviate reposting the exact same post, because it stores the Reddit post ID. What it doesn't help with is cross posts to other subreddits. For example, if user A submits post B to subreddit 1, and submits post C to subreddit 2, the existing implementation will consider the two posts different, and can potentially post them adjacently.

Need a way to deduplicate across cross posts.

Currently unsure of how to properly do this, as the titles are likely to be different. The only shared things would be the username of the poster, and the image contents. The image URL would be different (in some cases, because you can upload an image directly to Reddit), so not sure if there is a way to check the actual image contents and doing a diff on them.

SaxyPandaBear commented 2 years ago

Calculating the RME for the images might be a way to do it, but I think that doesn't account for sizing differences. Then there's the question of how do I check in practice? Will have to keep looking but seems like significant rework of what gets stored in Redis, which reduces the amount of entries that can be stored there for free, which is a fine tradeoff.

SaxyPandaBear commented 2 years ago

Testing and observations

Tested out with a specific observed cross-post by calculating the RMS difference between the two images pulled directly from Reddit, as well as some other random ones. Also tried calculating the RMSD for the same image resized. I think I'll have to set a threshold for the difference for what is allowable, and this might get tweaked over time. I think if the difference is > 50, that's a good indication that the images are different.

Now, a broader problem is, how do I know what has already been posted? This requires a significant redesign of the usage of Redis.

Redis data store design

In Redis, we can store sets of strings. We can leverage this along with JSON strings in order to store sets of composite/complex data.

What information is necessary in order to perform the calculation for deduplication?

  1. The username
  2. The submission ID
  3. A link to the image URL for the post We can model this in a JSON struct like so:
    "Foo": {
    "id": "abc-123",
    "url": "https://i.reddit.com/foobarbaz"
    }

    Breaking this down, we have the key for the object as the username, Foo. The object itself contains the submission ID, abc-123, and the image URL for the submission. Note that long URLs will eat up the storage we have on the cluster, but this is fine cause we aren't paying for it. It just means cached things will be evicted sooner.

With this JSON model, we can then store a set of JSON objects with the same username key.

"Foo": [
  {...},
  {...}
]

Assuming we have this in place, when a new submission needs to be checked, we can perform the following logic for validation:

  1. Check if the key exists in the data store - if the username is not cached, then we don't have a post from them and it's okay to post this submission.
  2. If the key exists, check if the submission ID is already in the set - if so, then we can't reuse this.
  3. If the submission ID is not already in use, check if the image RMS difference is less than a specified threshold
  4. Post it

Considerations

  1. Because the Redis cluster provisioned by Heroku is memory constrained, if we happen to keep getting cache hits for the same user, then their set of posts could grow large enough to evict all of the other keys. This is an odd edge case, and I don't want to try to figure out how to account for this for such a small dinky project.
  2. LRU cache eviction is fine because by using the username as the key, a hot user will stay hot in the cache. There shouldn't be any concern with eviction other than the aforementioned scenario above where one user's set of submissions overflows the Redis cluster memory constraint.
  3. If the same image gets cross-posted, but with different dimensions, there's a chance that it'll slip through depending on how different the image dimensions are. I'm fine with this.
SaxyPandaBear commented 2 years ago

Rather than doing RMSD, considering doing the hash on the image instead, because it is a symmetric operation.

print(rmsd('image1.jpg', 'image1.jpg'))
print(rmsd('image1.jpg', 'image7.jpg'))
print(rmsd('image7.jpg', 'image1.jpg'))
print(rmsd('image8.jpg', 'image7.jpg'))

print(hashdiff('image1.jpg', 'image1.jpg'))
print(hashdiff('image1.jpg', 'image7.jpg'))
print(hashdiff('image7.jpg', 'image1.jpg'))
print(hashdiff('image8.jpg', 'image7.jpg'))

prints the following result:

0.0
128.94635148412115
114.61897909699658
0.0
0
146521182184558493
-146521182184558493
0

Specifically between image1 and image7 (just a resizing), it is apparent that swapping the order of the RMSD inputs changes the end result, but swapping the inputs for the hash solution doesn't (can just take the absolute value of the hash difference).

SaxyPandaBear commented 2 years ago

Code written for this

def rmsd(image1: str, image2: str) -> float:
    with Image.open(image1) as i1, Image.open(image2) as i2:
        diff = ImageChops.difference(i1, i2)
        h = diff.histogram()
        sq = (value*((idx%256)**2) for idx, value in enumerate(h))
        sum_of_squares = sum(sq)
        rms = math.sqrt(sum_of_squares/float(i1.size[0] * i1.size[1]))
        return rms

def hashdiff(image1: str, image2: str):
    with Image.open(image1) as i1, Image.open(image2) as i2:
        return hash(i1.tobytes()) - hash(i2.tobytes())
SaxyPandaBear commented 2 years ago

It makes sense that the RMSD would differ based on the input since the root portion of the calculation is based on the image dimensions of the first input. If the images are different sizes, this results in a different output.

SaxyPandaBear commented 2 years ago

By comparing the hash, I don't need to redownload the image for checking every time and can just persist the hash of the image bytes.

SaxyPandaBear commented 2 years ago

For downloading the image, one thing that will be needed is to write it to disk.

>>> url
'https://i.redd.it/j5aamv6w2lz71.jpg'
>>> url.split('/')
['https:', '', 'i.redd.it', 'j5aamv6w2lz71.jpg']
>>> img = url.split('/')[-1]
>>> img
'j5aamv6w2lz71.jpg'
>>> data = requests.get(url).content
>>> with open(img, 'wb') as f:
...         f.write(data)
... 
222917
>>> exit()

Did the following in interactive mode, and should be sufficient for just downloading the image, but will need to ensure that the image gets cleaned up after the hash gets computed.

SaxyPandaBear commented 2 years ago

This doesn't really account for galleries (#4) because there's no guarantee that the image IDs in the individual galleries would sort the same way. The only thing I can think of to disambiguate them is to compute the hash of each image in the gallery, and then sort them by the image hashes. But honestly, that's such an edge case that I don't care. Further, if the image posted from one subreddit is different from the image posted from another subreddit with the same gallery contents, that's still technically consistent with what I want to do here with deduplication by the actual image posted, so I think it's fine.

SaxyPandaBear commented 2 years ago

d7ac14dac0e20e717746e63abd81776930706267 and 0dc91a4c17a7ea50754389dd44d776d6d2dde9a0 added to address bugs in this implementation