JuliaIO / Tar.jl

TAR files: create, list, extract them in pure Julia
MIT License
80 stars 19 forks source link

Add skip_symlink keyword to tree_hash #165

Closed mkitti closed 6 months ago

mkitti commented 6 months ago

This adds a skip_symlink keyword to tree_hash.

This allows for the calculation of a tree hash without symlinks. This allows for the detection of a failure condition when the operating system or file system does not support symlinks.

julia> using CodecZlib, Tar

julia> open(GzipDecompressorStream, "P4est.v2.8.1.x86_64-w64-mingw32-mpi+microsoftmpi.tar.gz") do io
           tree_hash = Tar.tree_hash(io)
       end
"89a337ea6f60a4fd58999ab73dea099e41032138"

julia> open(GzipDecompressorStream, "P4est.v2.8.1.x86_64-w64-mingw32-mpi+microsoftmpi.tar.gz") do io
           tree_hash = Tar.tree_hash(io; skip_symlink=true)
       end
"ed75b82e0dd9b53c4ac4e70376f3e6f330c72767"

xref: https://github.com/JuliaLang/Pkg.jl/issues/3643#issuecomment-1880017100

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (6269b5b) 97.28% compared to head (9f3fd9a) 97.53%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #165 +/- ## ========================================== + Coverage 97.28% 97.53% +0.24% ========================================== Files 4 4 Lines 810 811 +1 ========================================== + Hits 788 791 +3 + Misses 22 20 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

giordano commented 6 months ago

My understanding of the problem was that the error reported was when computing the tree hash of the unpacked archive, not inside a tarball.

mkitti commented 6 months ago

Yes, I understand. I'm responding to Elliott Saba's comment:

the condition hash_mismatch && cannot_create_symlinks is insufficient, you need to know that the reason the hash check failed is because you could not create a symlink

In order to verify that the reason for the mismatch is the symlink issue, we need something to compare the calculated hash of the unpacked archive to. With this option, we can now compare the calculated hash to tree hash of the tar ball without symlinks. If they match, then we can pretty certain that the issue is symlinks.

Compare "ed75b82e0dd9b53c4ac4e70376f3e6f330c72767" with the reported error message.

  Expected git-tree-sha1:   89a337ea6f60a4fd58999ab73dea099e41032138
  Calculated git-tree-sha1: ed75b82e0dd9b53c4ac4e70376f3e6f330c72767

From the above Tar.tree_hash(io; skip_symlink=true) I can now verify that the calculated tree hash is due to a lack of symlinks.

nhz2 commented 6 months ago

I think the keyword should be copy_symlinks to match the keyword used in Tar.extract.

mkitti commented 6 months ago

Well we are not copying anything, but that may depend on the circumstance. If the symlink is not resolvable, then copy can be made.

Does 7zip or other software copy the contents of files when they are internally resolvable? Is copy_symlinks actually used with extract somewhere?

nhz2 commented 6 months ago

Yes, copy_symlinks is used when extracting the iso_codes artifact: https://github.com/JuliaBinaryWrappers/iso_codes_jll.jl/releases/download/iso_codes-v4.15.0+0/iso_codes.v4.15.0.any.tar.gz on Windows when symlinks can't be created.

Typically, Pkg calls Tar.extract with copy_symlinks=nothing, and in that case Tar does a check to see if symlinks can be made here: https://github.com/JuliaIO/Tar.jl/blob/6269b5b8c4c863b1fe2beef53666273e3bbc021b/src/Tar.jl#L244-L246

I tried extracting the iso_codes artifact with "Developer Mode" off on Windows using 7zip and the file explorer built-in tar extractor and both gave error messages about not having permission to make symlinks.

nhz2 commented 6 months ago

Interestingly even with "Developer Mode" on, 7zip has an error when extracting the P4est artifact.

Dangerous link path was ignored : bin\libmsmpi.dll : ..\..\artifacts\454b73c5b1914776f73d2eae1b5c16c9a8bfc082\bin\msmpi.dll

image
giordano commented 6 months ago

If I'm reading the message correctly, that looks like an outside-of-tree symlink, and a bad one because it could easily be also broken

giordano commented 6 months ago

This line is fishy: https://github.com/JuliaPackaging/Yggdrasil/blob/4d82df7a64f9535fdba24119de35cb4c1eecc999/P/P4est/build_tarballs.jl#L51

mkitti commented 6 months ago

By the way, did you know that Windows comes with a BSD tar command line utility.

What does that do? https://learn.microsoft.com/en-us/virtualization/community/team-blog/2017/20171219-tar-and-curl-come-to-windows

nhz2 commented 6 months ago

BSD tar creates the symlink bin/libmsmpi.dll -> ../../artifacts/454b73c5b1914776f73d2eae1b5c16c9a8bfc082/bin/msmpi.dll

In this case, I think 7zip is just being overly cautious to avoid writing files outside the tree. Though the symlink is pretty fishy.

mkitti commented 6 months ago

The symlinks are not always skipped. We need to follow them to get the simulated windows hash.