Truncated filenames that are the same after 47 characters

andyjohnson0 / TakeoutExtractor

Extract content files from a Google Takeout archive, rename files with a consistent naming scheme and add missing metadata. Runs on Windows and MacOS.

Other

11 stars 2 forks source link

Truncated filenames that are the same after 47 characters #23

Open mholt opened 3 months ago

mholt commented 3 months ago

Hey Andy, thanks for making this great utility. I've been using it as a reference for some of my own code.

One of my important albums (wedding!) has all its photos with names longer than 50 chars, and the first 48 characters are all the same. I've found that filename truncation applied by Google Photos Takeout is... problematic, because by truncating after (at?) 46 characters, I don't know if it's possible to reconstitute the original filename. For example, abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUdifferentStuffHere.jpg would be truncated at about abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTU(N).jpg and thus any unique part of the filename is lost.

Do you know if there's a way to correlate media files with their sidecar JSON files in this case?

This isn't a bug with your tool, more of just a question based on your experience.

mholt commented 3 months ago

I think I figured this out.

Google assigns the "uniqueness suffix" ((N)) by incrementing it based on the order it goes through the files. My program used lexical iteration by default (that's what Go std lib does), but Google seems to use some sort of natural sort, where [a2 a20 a1 a10] is sorted as [a1 a2 a10 a20] instead of [a1 a10 a2 a20] -- but I also had to account for length first, then natural sort within filenames of the same length. Then I was able to successfully correlate the media files with their sidecar files in this case. It just took counting how many of each truncated filename I saw as I iterated the list of files in the same order Google did when it generated the archive.

Of course, that's totally undocumented and Google could change it at any time, but this does seem to work for me.

andyjohnson0 commented 3 months ago

Hi Matt,

Its a tricky problem, and I think your solution is probably the best approach. But, as always with this data, its hard to be confident that there aren't edge-cases that would create a false match. Its quite difficult to imagine a more poorly designed export format: its almost as if the photos team resented the takeout team and decided to deliberately mess with them.

Looking at my unit tests, I'm fairly sure I need to improve the ability to match sidecars in the scenarios you've described. For that reason, I'll re-open the ticket as a reminder.

Thanks for the kind words - I'm glad what I came up with has been useful to you.