alexcrichton / tar-rs

Tar file reading/writing for Rust
https://docs.rs/tar
Apache License 2.0
625 stars 184 forks source link

header: set entry_size() to 0 for hardlinks #314

Open npajkovsky opened 1 year ago

npajkovsky commented 1 year ago

Fix the error: "numeric field was not a number: t 'Regard' when getting cksum for 'a text ...'. The error is caused by wrongly setting the self.next position due to accounting harlinks size.

According to man 5 tar, the size of the hard-links should be set to 0.

size Size of file, as octal number in ASCII. For regular files only, this indicates the amount of data that follows the header. In particular, this field was ignored by early tar implementations when extracting hardlinks. Modern writers should always store a zero length for hardlink entries.

But since the writer wasn't modern, the entry_size is 64700 which causes miscalculation of the self.next.

[tar-rs/src/archive.rs:372] &entry.header() = UstarHeader { entry_size: 64700, size: 64700, path: "some/path", link_name: Some( "some/link", ), mode: 0o640, uid: 1058, gid: 1061, mtime: 1673424346, username: Some( "example", ), groupname: Some( "example", ), device_major: Some( 9, ), device_minor: Some( 2, ), cksum: 24700, cksum_valid: true, } [tar-rs/src/archive.rs:373] entry.header().entry_type() = Link

Closes: #313

alexcrichton commented 1 year ago

If the source of this is the man page then I would interpret that as the archive in question being an invalid archive rather than one that should be silently handled to work instead. Does the tar CLI, though, handle this case? Would you be able to find in its source code where it handles this case?

npajkovsky commented 1 year ago

Both bsdtar and tar handles this tarball correctly.

npajkovsky commented 1 year ago

libarchive does this

libarchive/libarchive/archive_read_support_format_tar.c

        /*
          * The following may seem odd, but: Technically, tar
          * does not store the file type for a "hard link"
          * entry, only the fact that it is a hard link.  So, I
          * leave the type zero normally.  But, pax interchange
          * format allows hard links to have data, which
          * implies that the underlying entry is a regular
          * file.
          */
         if (archive_entry_size(entry) > 0)
             archive_entry_set_filetype(entry, AE_IFREG);

         /*
          * A tricky point: Traditionally, tar readers have
          * ignored the size field when reading hardlink
          * entries, and some writers put non-zero sizes even
          * though the body is empty.  POSIX blessed this
          * convention in the 1988 standard, but broke with
          * this tradition in 2001 by permitting hardlink
          * entries to store valid bodies in pax interchange
          * format, but not in ustar format.  Since there is no
          * hard and fast way to distinguish pax interchange
          * from earlier archives (the 'x' and 'g' entries are
          * optional, after all), we need a heuristic.
          */
         if (archive_entry_size(entry) == 0) {
             /* If the size is already zero, we're done. */
         }  else if (a->archive.archive_format
             == ARCHIVE_FORMAT_TAR_PAX_INTERCHANGE) {
             /* Definitely pax extended; must obey hardlink size. */
         } else if (a->archive.archive_format == ARCHIVE_FORMAT_TAR
             || a->archive.archive_format == ARCHIVE_FORMAT_TAR_GNUTAR)
         {
             /* Old-style or GNU tar: we must ignore the size. */
             archive_entry_set_size(entry, 0);
             tar->entry_bytes_remaining = 0;
         } else if (archive_read_format_tar_bid(a, 50) > 50) {
             /*
              * We don't know if it's pax: If the bid
              * function sees a valid ustar header
              * immediately following, then let's ignore
              * the hardlink size.
              */
             archive_entry_set_size(entry, 0);
             tar->entry_bytes_remaining = 0;
         }
         /*
          * TODO: There are still two cases I'd like to handle:
          *   = a ustar non-pax archive with a hardlink entry at
          *     end-of-archive.  (Look for block of nulls following?)
          *   = a pax archive that has not seen any pax headers
          *     and has an entry which is a hardlink entry storing
          *     a body containing an uncompressed tar archive.
          * The first is worth addressing; I don't see any reliable
          * way to deal with the second possibility.
          */
         break;
npajkovsky commented 1 year ago

And here https://git.savannah.gnu.org/cgit/tar.git/tree/src/list.c#n442

alexcrichton commented 1 year ago

Thanks! I'm not sure if those line up though? Your second link unconditionally sets the size to zero, and the first code you copy/pasted only sets the size to zero for non-ustar archives. The PR, however, unconditionally sets it to zero. Is the copy/pasted code outdated or otherwise no longer in use?

npajkovsky commented 1 year ago

The copy/pasted was copied directly from the libarchive git repo HEAD. I think the interesting part is

} else if (archive_read_format_tar_bid(a, 50) > 50) {
             /*
              * We don't know if it's pax: If the bid
              * function sees a valid ustar header
              * immediately following, then let's ignore
              * the hardlink size.
              */
             archive_entry_set_size(entry, 0);
             tar->entry_bytes_remaining = 0;
         }