Open Lymia opened 2 months ago
An alternative suggestion to lengthening the hash would be to just put the attachment ID in the filename. I feel like at a certain point, a bunch of random base-32 characters doesn't really provide much utility over 19 digits (which would be guaranteed to never collide).
Pasted from https://github.com/Tyrrrz/DiscordChatExporter/pull/1232
I have a single channel with 8774 conflicts. This command counts how many times each filename occurred:
❯ cat 98609319519453184.json | jq -r '.messages[].attachments[].url' | sort | uniq -c | sort -nr > discord-98609319519453184-dupes.txt
This command removes the lines with only 1 instance of a filename then sums up all of the file counts for files with conflicts:
❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | paste -sd+ - | bc 17453
Then count the number of filenames that occur more than once:
❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | wc -l 8679
17453-8679 = 8774
I manually checked some of the messages that had duplicate filenames in discord and they were indeed unique images in discord but only one image was saved locally. I've checked a different channel too and the same thing happened. I can't remember if it's the default or not, but I store each channels images in its own folder
❯ ls -1U | head -10 670866322619498511-media 670866322619498511.json 704107851421057114-media 704107851421057114.json 704226060178292846-media 704226060178292846.json 862221868621758502-media 862221868621758502.json 862732083050971146-media 862732083050971146.json
It would be really great if you could implement an option to make each name unique. People that want to stick with the old method don't need to do anything if it's done this way.
Here's a set of images that end up with the same file names: unknown-F682B.png:
https://cdn.discordapp.com/attachments/98609319519453184/760954915362832414/unknown.png?ex=66aa987d&is=66a946fd&hm=3c3c51658e86c5f906cb4e22c78258836c4e1f3144086c5a7f1a261ced04c8e5&
https://cdn.discordapp.com/attachments/98609319519453184/1001287339462627408/unknown.png?ex=66aa3040&is=66a8dec0&hm=dfdf80af98d7fdb373cdb3591e1d8e106312200f1f93d2548f56fd7e185d3226&
unknown-A4DB4.png:
https://cdn.discordapp.com/attachments/98609319519453184/352621773524697090/unknown.png?ex=66aa385e&is=66a8e6de&hm=2eb1fd389f5522e18ec85a6a52e32fe3c2cfb364ac4daa1d4979405109c8a547&
https://cdn.discordapp.com/attachments/98609319519453184/826940484042620938/unknown.png?ex=66aab557&is=66a963d7&hm=0d7a7c51108596c84207a29b9d5b0b5f19503b7a43d92d83306c37da58825a5a&
unknown-F1191.png:
https://cdn.discordapp.com/attachments/98609319519453184/342358582458712064/unknown.png?ex=66aa74c6&is=66a92346&hm=dc2d387a9da0597b3a2668491b2b4a42032aafde351268c883d498ac562cfbaf&
https://cdn.discordapp.com/attachments/98609319519453184/532705112649629707/unknown.png?ex=66aa228c&is=66a8d10c&hm=a2e4e9cb280a8978aad52611b1f59276748815d6163e78ddf88c4cc8ae463fcd&
https://cdn.discordapp.com/attachments/98609319519453184/1024319357066682440/unknown.png?ex=66aa433d&is=66a8f1bd&hm=c51fc893d150ef8946d843c416c217816beed1cea4c7e30a7d92ceca86e7b53d&
unknown-7DBD3.png:
https://cdn.discordapp.com/attachments/98609319519453184/187127175315456002/unknown.png?ex=66aaa503&is=66a95383&hm=0dca9934bad285ecba5e5499b73399093056dffdf3ac43db3e3799c18f12250d&
https://cdn.discordapp.com/attachments/98609319519453184/418214803568328713/unknown.png?ex=66aa3923&is=66a8e7a3&hm=51ea7853501a6263624fe5830bf329b9aa9e72f5fb3ad584788e249233250552&
hmm that's concerning, considering I use this on bigger discord as well where collisions would probably be likely
Version
v2.42.8
Flavor
GUI (Graphical User Interface), CLI (Command-Line Interface)
Platform
Linux
Export format
HTML, TXT, JSON, CSV
Steps to reproduce
Export servers where large numbers of videos, images and especially screenshots are often posted. Observe that files named
image.png
orunknown.png
are extremely common due to copy-paste of images directly into Discord, often leaving no file name available.Do math to confirm the birthday problem is at play and will be causing you problems.
Details
The algorithm for making file names for downloaded media only stores 20 bytes of the hash of the URL. This is insufficient.
Despite there being 1048576 possible hashes, it only takes 1200 images for there to be a 50% chance of a collision and 500 images before there is a 10% chance of a collision due to the birthday problem. The original reasoning in the issue where this code was added (#395) is incorrect due to the commonness of
unknown.png
,image.png
, and other filenames.A list of all filenames I've seen that could reach that 10% chance of a collision in just my own exports of servers I'm in:
As can clearly be seen, file names are not evenly distributed. Notably,
maxresdefault
,sddefault
andhqdefault
are not being posted intentionally by users at all, and are instead the result of youtube thumbnails. Similarly,tenor
happens because the tenor gif button is pressed in Discord's own UI.There is a near 100% chance that there are collisions here between
unknown.png
,image.png
, andmaxresdefault.jpg
. You can check with this calculator: https://www.bdayprob.com/. SolveP(D,N)
withD = 1048576
andN = 6637
.10 characters provides a better margin of safety: 15000 images with the same name before there is even a 0.01% chance, which would have successfully handled my use case.
10 characters of base-32 rather than base-16 would provide 600000 images before there is the same chance of collision. 8 characters of base-32 would provide the same margin as the 10 characters of base-16.
Checklist