Improve filenames for downloaded assets

Lymia commented 4 months ago

This changes the filenames output by DiscordChatExporter to be more meaningful (in the case of emojis and avatars) and more resilient against collisions.

The hash is 12 characters of base32 now, rather than 5 characters of hexadecimal. This allows for nearly 5 million downloads with the same name before there is even a 0.001% of a single collision. This should actually be enough, even for problematic filenames like those associated with Youtube thumbnails.

Emojis and Discord attachments are instead guaranteed to not collide as they instead contain the unique snowflake of the attachment or emoji. The 19 digit id is significantly longer than the old 5 digit ids - however, reencoding it in base32 would only save 6 characters, so better to use the more recognizable numeric form IMO.

Closes #1231

Lymia commented 2 months ago

Any update on this? I personally do use this patch to backup a few servers that trigger this bug.

Tyrrrz commented 2 months ago

Can't really change the naming algorithm as it will break "skip downloaded assets" for all existing exports.

Lymia commented 2 months ago

That is, unfortunately, the point. "Skip Downloaded Assets" as it currently works has a severe risk of incorrectly skipping non-duplicated assets, and there's no good way to fix that without invalidating past downloads. Do you think it'd work to put this behind a command line flag, instead of changing the default behavior?

Tyrrrz commented 2 months ago

I think your concern is statistically valid, but there yet haven't been any bug reports in regard to asset filename collision. If/when that happens, then it makes sense to introduce the breaking change to fix such an issue. I want to make sure we're fixing real problems when we add new ones (by breaking compatibility) 🙂

Maybe down the line there will be other related breaking changes, then it will make sense to group them together. Right now I don't see it happening any time soon.

ofifoto commented 2 months ago

haven't been any bug reports in regard to asset filename collision

would there be an error? or would the file just silently be skipped? in a bigger discord that would be very easy to miss if it was just silently skipped

Tyrrrz commented 2 months ago

would there be an error? or would the file just silently be skipped? in a bigger discord that would be very easy to miss if it was just silently skipped

There wouldn't be an error now, but that can be changed.

ofifoto commented 2 months ago

There wouldn't be an error now, but that can be changed.

I think that's a nice compromise :)

Twi-Hard commented 2 months ago

I have a single channel with 8774 conflicts. This command counts how many times each filename occurred:

❯ cat 98609319519453184.json | jq -r '.messages[].attachments[].url' | sort | uniq -c | sort -nr > discord-98609319519453184-dupes.txt

This command removes the lines with only 1 instance of a filename then sums up all of the file counts for files with conflicts:

❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | paste -sd+ - | bc
17453

Then count the number of filenames that occur more than once:

❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | wc -l
8679

17453-8679 = 8774

I manually checked some of the messages that had duplicate filenames in discord and they were indeed unique images in discord but only one image was saved locally. I've checked a different channel too and the same thing happened. I can't remember if it's the default or not, but I store each channels images in its own folder

❯ ls -1U | head -10
670866322619498511-media
670866322619498511.json
704107851421057114-media
704107851421057114.json
704226060178292846-media
704226060178292846.json
862221868621758502-media
862221868621758502.json
862732083050971146-media
862732083050971146.json

It would be really great if you could implement an option to make each name unique. People that want to stick with the old method don't need to do anything if it's done this way.

Tyrrrz commented 2 months ago

@Twi-Hard please add this info to the linked issue (#1231). If you can, please also provide a few different URLs that resolve to the same hash/filename with the current algorithm.

Tyrrrz / DiscordChatExporter

Improve filenames for downloaded assets #1232