Tyrrrz / DiscordChatExporter

Exports Discord chat logs to a file
MIT License
7.29k stars 668 forks source link

`--reuse-media` creates an unacceptably high chance of a collision #1231

Open Lymia opened 2 months ago

Lymia commented 2 months ago

Version

v2.42.8

Flavor

GUI (Graphical User Interface), CLI (Command-Line Interface)

Platform

Linux

Export format

HTML, TXT, JSON, CSV

Steps to reproduce

Export servers where large numbers of videos, images and especially screenshots are often posted. Observe that files named image.png or unknown.png are extremely common due to copy-paste of images directly into Discord, often leaving no file name available.

Do math to confirm the birthday problem is at play and will be causing you problems.

Details

The algorithm for making file names for downloaded media only stores 20 bytes of the hash of the URL. This is insufficient.

Despite there being 1048576 possible hashes, it only takes 1200 images for there to be a 50% chance of a collision and 500 images before there is a 10% chance of a collision due to the birthday problem. The original reasoning in the issue where this code was added (#395) is incorrect due to the commonness of unknown.png, image.png, and other filenames.

A list of all filenames I've seen that could reach that 10% chance of a collision in just my own exports of servers I'm in:

    508 hqdefault.jpg
    541 image0.png
    681 sddefault.jpg
   1269 image0.jpg
   1696 tenor.png
   5798 maxresdefault.jpg
   6637 image.png
  11314 unknown.png

As can clearly be seen, file names are not evenly distributed. Notably, maxresdefault, sddefault and hqdefault are not being posted intentionally by users at all, and are instead the result of youtube thumbnails. Similarly, tenor happens because the tenor gif button is pressed in Discord's own UI.

There is a near 100% chance that there are collisions here between unknown.png, image.png, and maxresdefault.jpg. You can check with this calculator: https://www.bdayprob.com/. Solve P(D,N) with D = 1048576 and N = 6637.

10 characters provides a better margin of safety: 15000 images with the same name before there is even a 0.01% chance, which would have successfully handled my use case.

10 characters of base-32 rather than base-16 would provide 600000 images before there is the same chance of collision. 8 characters of base-32 would provide the same margin as the 10 characters of base-16.

Checklist

96-LB commented 2 months ago

An alternative suggestion to lengthening the hash would be to just put the attachment ID in the filename. I feel like at a certain point, a bunch of random base-32 characters doesn't really provide much utility over 19 digits (which would be guaranteed to never collide).

Twi-Hard commented 12 hours ago

Pasted from https://github.com/Tyrrrz/DiscordChatExporter/pull/1232

I have a single channel with 8774 conflicts. This command counts how many times each filename occurred:

❯ cat 98609319519453184.json | jq -r '.messages[].attachments[].url' | sort | uniq -c | sort -nr > discord-98609319519453184-dupes.txt

This command removes the lines with only 1 instance of a filename then sums up all of the file counts for files with conflicts:

❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | paste -sd+ - | bc
17453

Then count the number of filenames that occur more than once:

❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | wc -l
8679

17453-8679 = 8774

I manually checked some of the messages that had duplicate filenames in discord and they were indeed unique images in discord but only one image was saved locally. I've checked a different channel too and the same thing happened. I can't remember if it's the default or not, but I store each channels images in its own folder

❯ ls -1U | head -10
670866322619498511-media
670866322619498511.json
704107851421057114-media
704107851421057114.json
704226060178292846-media
704226060178292846.json
862221868621758502-media
862221868621758502.json
862732083050971146-media
862732083050971146.json

It would be really great if you could implement an option to make each name unique. People that want to stick with the old method don't need to do anything if it's done this way.

Here's a set of images that end up with the same file names: unknown-F682B.png:

https://cdn.discordapp.com/attachments/98609319519453184/760954915362832414/unknown.png?ex=66aa987d&is=66a946fd&hm=3c3c51658e86c5f906cb4e22c78258836c4e1f3144086c5a7f1a261ced04c8e5&
https://cdn.discordapp.com/attachments/98609319519453184/1001287339462627408/unknown.png?ex=66aa3040&is=66a8dec0&hm=dfdf80af98d7fdb373cdb3591e1d8e106312200f1f93d2548f56fd7e185d3226&

unknown-A4DB4.png:

https://cdn.discordapp.com/attachments/98609319519453184/352621773524697090/unknown.png?ex=66aa385e&is=66a8e6de&hm=2eb1fd389f5522e18ec85a6a52e32fe3c2cfb364ac4daa1d4979405109c8a547&
https://cdn.discordapp.com/attachments/98609319519453184/826940484042620938/unknown.png?ex=66aab557&is=66a963d7&hm=0d7a7c51108596c84207a29b9d5b0b5f19503b7a43d92d83306c37da58825a5a&

unknown-F1191.png:

https://cdn.discordapp.com/attachments/98609319519453184/342358582458712064/unknown.png?ex=66aa74c6&is=66a92346&hm=dc2d387a9da0597b3a2668491b2b4a42032aafde351268c883d498ac562cfbaf&
https://cdn.discordapp.com/attachments/98609319519453184/532705112649629707/unknown.png?ex=66aa228c&is=66a8d10c&hm=a2e4e9cb280a8978aad52611b1f59276748815d6163e78ddf88c4cc8ae463fcd&
https://cdn.discordapp.com/attachments/98609319519453184/1024319357066682440/unknown.png?ex=66aa433d&is=66a8f1bd&hm=c51fc893d150ef8946d843c416c217816beed1cea4c7e30a7d92ceca86e7b53d&

unknown-7DBD3.png:

https://cdn.discordapp.com/attachments/98609319519453184/187127175315456002/unknown.png?ex=66aaa503&is=66a95383&hm=0dca9934bad285ecba5e5499b73399093056dffdf3ac43db3e3799c18f12250d&
https://cdn.discordapp.com/attachments/98609319519453184/418214803568328713/unknown.png?ex=66aa3923&is=66a8e7a3&hm=51ea7853501a6263624fe5830bf329b9aa9e72f5fb3ad584788e249233250552&
ofifoto commented 10 hours ago

hmm that's concerning, considering I use this on bigger discord as well where collisions would probably be likely