Open Lymia opened 4 months ago
Any update on this? I personally do use this patch to backup a few servers that trigger this bug.
Can't really change the naming algorithm as it will break "skip downloaded assets" for all existing exports.
That is, unfortunately, the point. "Skip Downloaded Assets" as it currently works has a severe risk of incorrectly skipping non-duplicated assets, and there's no good way to fix that without invalidating past downloads. Do you think it'd work to put this behind a command line flag, instead of changing the default behavior?
I think your concern is statistically valid, but there yet haven't been any bug reports in regard to asset filename collision. If/when that happens, then it makes sense to introduce the breaking change to fix such an issue. I want to make sure we're fixing real problems when we add new ones (by breaking compatibility) 🙂
Maybe down the line there will be other related breaking changes, then it will make sense to group them together. Right now I don't see it happening any time soon.
haven't been any bug reports in regard to asset filename collision
would there be an error? or would the file just silently be skipped? in a bigger discord that would be very easy to miss if it was just silently skipped
would there be an error? or would the file just silently be skipped? in a bigger discord that would be very easy to miss if it was just silently skipped
There wouldn't be an error now, but that can be changed.
There wouldn't be an error now, but that can be changed.
I think that's a nice compromise :)
I have a single channel with 8774 conflicts. This command counts how many times each filename occurred:
❯ cat 98609319519453184.json | jq -r '.messages[].attachments[].url' | sort | uniq -c | sort -nr > discord-98609319519453184-dupes.txt
This command removes the lines with only 1 instance of a filename then sums up all of the file counts for files with conflicts:
❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | paste -sd+ - | bc
17453
Then count the number of filenames that occur more than once:
❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | wc -l
8679
17453-8679 = 8774
I manually checked some of the messages that had duplicate filenames in discord and they were indeed unique images in discord but only one image was saved locally. I've checked a different channel too and the same thing happened. I can't remember if it's the default or not, but I store each channels images in its own folder
❯ ls -1U | head -10
670866322619498511-media
670866322619498511.json
704107851421057114-media
704107851421057114.json
704226060178292846-media
704226060178292846.json
862221868621758502-media
862221868621758502.json
862732083050971146-media
862732083050971146.json
It would be really great if you could implement an option to make each name unique. People that want to stick with the old method don't need to do anything if it's done this way.
@Twi-Hard please add this info to the linked issue (#1231). If you can, please also provide a few different URLs that resolve to the same hash/filename with the current algorithm.
This changes the filenames output by DiscordChatExporter to be more meaningful (in the case of emojis and avatars) and more resilient against collisions.
The hash is 12 characters of base32 now, rather than 5 characters of hexadecimal. This allows for nearly 5 million downloads with the same name before there is even a 0.001% of a single collision. This should actually be enough, even for problematic filenames like those associated with Youtube thumbnails.
Emojis and Discord attachments are instead guaranteed to not collide as they instead contain the unique snowflake of the attachment or emoji. The 19 digit id is significantly longer than the old 5 digit ids - however, reencoding it in base32 would only save 6 characters, so better to use the more recognizable numeric form IMO.
Closes #1231