cff29546 / pzmap2dzi

A command-line tool to convert Project Zomboid map data into Deep Zoom format
MIT License
51 stars 17 forks source link

Deduplication to save space/time? #8

Open shughes-uk opened 9 months ago

shughes-uk commented 9 months ago

A quick check with rdfind and it seems a decent chunk of the final map files are duplicates of each other.

A post-processing step that replaces the duplicates with symlinks (not sure what the windows equivalent is) would be neat.

I haven't looked much at the internals, but possibly you could save even more time by creating a pre-processing step to avoid rendering duplicates in the first place. Creating an md5 of the component parts/positions of a tile and storing it in a cache seems like it would work well? When you get a cache hit, you can just not render a tile and output a symlink instead!

Thoughts?

shughes-uk commented 9 months ago

Oh further thought, you might even be able to speed up the map loading in the browser if you can use http redirects on detecting a symlink instead of sending the file. That way browser caching will kick in!

Not sure how that would work for the statically hosted version.

cff29546 commented 9 months ago

Windows supports both symbolic links and hard links that can achieve this. Can you share the results of rdfind? I'm curious to know how much space it can save.

shughes-uk commented 9 months ago

It was about 15%, ill have to run it again for the exact output. I was investigating swapping certain tiles for more identical ones in the hopes of getting even better results, but ran out of time.

I think if you swapped some of the grass tiles to something uniform, it would look less nice but you'd achieve really good "compression".

shughes-uk commented 9 months ago

I did modify the flask server to return redirects when encountering a symbolic link and it worked great.

cff29546 commented 9 months ago

It is possible to remove all ground cover and use a single type of tree with a small change.

However, if space is a concern, dropping the lowest level of the image pyramid can save 75% of space. And it is already supported with following config:

omit_levels: 1
enable_cache: true
cff29546 commented 9 months ago

I checked with rdfind, it's actually 20% (76/379) save on the base map.

G:\pzmap\html\base>debian run rdfind .
Now scanning ".", found 1432429 files.
Now have 1432429 files in total.
Removed 0 files due to nonunique device and inode.
Total size is 406939651011 bytes or 379 GiB
Removed 88006 files due to unique sizes from list.1344423 files left.
Now eliminating candidates based on first bytes:removed 29471 files from list.1314952 files left.
Now eliminating candidates based on last bytes:removed 949759 files from list.365193 files left.
Now eliminating candidates based on sha1 checksum:removed 3455 files from list.361738 files left.
It seems like you have 361738 files that are not unique
Totally, 76 GiB can be reduced.
Now making results file results.txt
shughes-uk commented 9 months ago

Nice! I was running on partials to preserve space on my laptop.

Honestly I think the best win from this would be reducing bandwidth usage for a deployed version, i'd be sad to lose all the zoom layers even if you save cost.

Removing all ground cover sounds like might increase the % a lot, but a single tree type i'm not sure would save much if any.