dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

Tar APIs pending feedback to address #68230

Open carlossanlop opened 2 years ago

carlossanlop commented 2 years ago

Initial PR changes were introduced here: https://github.com/dotnet/runtime/pull/67883

The following is pending feedback I need to address (not urgent, for 8.0, mostly just additional testing coverage):

Perf improvements:

Additional test coverage:

Done:

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
I received feedback in my first Tar PR https://github.com/dotnet/runtime/pull/67883 that I would like to address in a follow-up PR. - [ ] Address chmod suggestions by @tmds . - [ ] Address stackalloc and stream-related suggestions by @stephentoub . - [ ] Implement UName and GName from the current process' gid and uid, maybe use GetNameFromUid and (new) GetNameFromGid. - [ ] Make sure I have full tests for individual extraction of all data types. - [ ] Add a symlink test that has a link with an absolute path to a target. - [ ] Make sure that when archiving an executable, then extracting it, the executable mode bit gets properly preserved. - [ ] Add test that reads an archive containing an unsupported entry type (no writing). - [ ] Add test that advances DataStream a bit before extracting into filesystem, verify data was written starting from that position. - [ ] Add test that ensures a hidden file can be used to create an entry from file. - [ ] Verify these GNU fields are written in the data stream: AllGnuUnused = Offset + LongNames + Unused + Sparse + IsExtended + RealSize. - [ ] Add test that ensures that a GNU archive (generated with tar tool) containing unused GNU bytes (sparse, etc) get preserved when written to another GNU archive. - [ ] Add simple tests to verify uncompressing a tar.gz works. - [ ] Add test to extract entries to disk one by one. - [ ] Verify extended attributes in all PAX tests. - [ ] Add Ustar checksum test. - [ ] Add Pax checksum test. - [ ] Add Gnu checksum test. - [ ] Find all the stream.Read calls and make sure we read all that was expected (use similar logic to what's being done in ReadOrThrow). - [ ] Investigate why retrieving the devmajor and devminor with syscalls are not matching the expected values. - [ ] Add to the runtime-assets script an unarchived test with both a longlink and a longpath. - [ ] Add test that reads an archive from a file with both a longlink and a longpath. - [ ] Address documentation suggestions, make sure all exceptions and remarks are properly filled out. - [ ] Remove src csproj nullable enable after @eerhardt merges his PR. - [ ] Find out how (if possible) to add a file as a hardlink, because otherwise, it can only be created directly as an entry, not by reading it from the filesystem. - [ ] Think of a better way to verify DateTimeOffsets in tests. - [ ] Implement async TarFile APIs. - [ ] Implement async TarEntry APIs. - [ ] Implement async TarWriter APIs. - [ ] Implement async TarReader APIs. - [ ] Share WrappedStream in Common as suggested by @danmoseley . - [ ] Discuss if the TarFile extraction methods should extract something and throw at the end. - [ ] Discuss if we want to merge DevMajor and DevMinor into a single public API.
Author: carlossanlop
Assignees: carlossanlop
Labels: `area-System.IO`, `untriaged`
Milestone: -
Bio2hazard commented 2 years ago

On .NET 7 Preview 4, I've been testing the Tar APIs on some tar.gz files created on our teamcity agents (in ustar format) and ran into 2 issues:

Hope that helps.

carlossanlop commented 2 years ago

Thank you for your report, @Bio2hazard ! Any chance you can share a sample tar for each one of your experiments?

Bio2hazard commented 2 years ago

@carlossanlop yes I was able to reproduce it without sensitive information, thankfully. Here it is: testarchive.tar.gz

It has 3 files in it:

I was trying to benchmark it, so I'll just drop the benchmarking code in here. Goal is just to get the content byte[] for every file in the archive.

using System.Formats.Tar;
using System.IO.Compression;
using BenchmarkDotNet.Attributes;

namespace Benchmarks7;

[GcServer(true)]
[GcForce(true)]
[MemoryDiagnoser]
public class ReadFromTarGz
{
    private static byte[] fileData;

    public ReadFromTarGz()
    {
        fileData = File.ReadAllBytes(@"testarchive.tar.gz");
    }

    [Benchmark]
    public List<byte[]> WithSharpCompress()
    {
        var ret = new List<byte[]>();
        using (var fileStream = new MemoryStream(fileData))
        using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
        using (var tarReader = SharpCompress.Readers.Tar.TarReader.Open(stream))
        {
            while (tarReader.MoveToNextEntry())
            {
                var entry = tarReader.Entry;
                if (entry.IsDirectory)
                {
                    continue;
                }

                byte[] entryData = new byte[entry.Size];
                using (var fileMemoryStream = new MemoryStream(entryData))
                using (var entryStream = tarReader.OpenEntryStream())
                {
                    entryStream.CopyTo(fileMemoryStream);
                }
                ret.Add(entryData);
            }
        }

        return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_PreDecompressed()
    {
        var ret = new List<byte[]>();
        using var decompressedStream = new MemoryStream();

        using (var fileStream = new MemoryStream(fileData))
        using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
        {
            stream.CopyTo(decompressedStream);
        }

        decompressedStream.Position = 0;

        using (var tarReader = new System.Formats.Tar.TarReader(decompressedStream))
        {
            System.Formats.Tar.TarEntry entry;
            while ((entry = tarReader.GetNextEntry()) != null)
            {
                if (entry.EntryType == TarEntryType.Directory)
                {
                    continue;
                }

                byte[] entryData = new byte[entry.Length];
                using (var fileMemoryStream = new MemoryStream(entryData))
                {
                    entry.DataStream.CopyTo(fileMemoryStream);
                }
                ret.Add(entryData);
            }
        }

        return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_StreamedDecompression()
    {
        var ret = new List<byte[]>();
        using (var fileStream = new MemoryStream(fileData))
        using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
        using (var tarReader = new System.Formats.Tar.TarReader(stream))
        {
            System.Formats.Tar.TarEntry entry;
            while ((entry = tarReader.GetNextEntry()) != null)
            {
                if (entry.EntryType == TarEntryType.Directory)
                {
                    continue;
                }

                byte[] entryData = new byte[entry.Length];
                using (var fileMemoryStream = new MemoryStream(entryData))
                {
                    entry.DataStream.CopyTo(fileMemoryStream);
                }
                ret.Add(entryData);
            }
        }

        return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_StreamedDecompression_WithCopyData()
    {
        var ret = new List<byte[]>();
        using (var fileStream = new MemoryStream(fileData))
        using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
        using (var tarReader = new System.Formats.Tar.TarReader(stream))
        {
            System.Formats.Tar.TarEntry entry;
            while ((entry = tarReader.GetNextEntry(true)) != null)
            {
                if (entry.EntryType == TarEntryType.Directory)
                {
                    continue;
                }

                byte[] entryData = new byte[entry.Length];
                using (var fileMemoryStream = new MemoryStream(entryData))
                {
                    entry.DataStream.CopyTo(fileMemoryStream);
                }
                ret.Add(entryData);
            }
        }

        return ret;
    }
}

Exception:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
 ---> System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Benchmarks7.ReadFromTarGz.WithSystemsFormatTar_PreDecompressed() in C:\Development\Benchmarks7\ReadFromTarGz.cs:line 65

I hope that helps!

carlossanlop commented 2 years ago

Hi again @Bio2hazard ,

What tool did you use to generate the tar file inside your gz file?

I inspected the internal tar file with the Hex Editor HxD. The magic and version metadata fields of the first 2 entries indicate the archive is in ustar format, but then it contains an unexpected GNU format entry with the TarType LongLink in the 3rd position (and then a 4th entry containing the actual data, because the LongLink entry is a metadata entry). Hence why you're getting the FormatException thrown: the reader detects that the archive is malformed for containing mixed format entries, and it is expected that the exception is thrown when attempting to read the 3rd one.

Here is what the spec has to say about the magic and version fields depending on the archive format:

For POSIX archives (ustar and pax)

     magic   Contains the magic value "ustar" followed by a NUL byte to indi-
         cate that this is a POSIX standard archive.  Full compliance re-
         quires the uname and gname fields be properly set.

     version
         Version.  This should be "00" (two copies of the ASCII digit
         zero) for POSIX standard archives.

And then a few lines below it says:

     Field termination is specified slightly differently by POSIX than by pre-
     vious implementations.  The magic, uname, and gname fields must have a
     trailing NUL. 

For GNU

 magic   The magic field holds the five characters "ustar" followed by a
         space.  Note that POSIX ustar archives have a trailing null.
version
         The version field holds a space character followed by a null.
         Note that POSIX ustar archives use two copies of the ASCII digit
         "0".

In other words:

I'm surprised that SharpCompress and 7-zip are able to open this malformed archive. This tells me that they are flexible when it comes to mixed format entries. The spec does not explicitly indicate the entries should all be in the same format, but I interpreted it this way considering there are metadata entries like L, K (GNU) or x, g (PAX) which precede other entries. Also, the Unix tar CLI tool does not mix format entries, and fails to write entries with long paths and long entries if you create an archive in v7 or ustar, because those formats do not support that.

Based on the above, I see one public API structure change we should make: If archives are allowed to have mixed entries as we are seeing in your example (or in other words, should not expect all entries to be of the same format), then we would have to remove the public Format property from TarReader, and we would have to move it to the base TarEntry class. That way, the user will have to check this property from each individual entry so they know to which format they should cast the returned entry.

Note: The TarWriter receives a Format in the constructor, and this is important, particularly when creating a PAX archive, because the user should be able to add global extended attributes which only show up at the beginning of the archive. So the constructor's format argument should stay. But if the user creates a V7 or a Ustar formatted archive with the writer, and suddenly decides to write an unsupported entry like a GnuTarEntry or a PaxTarEntry with a long path or a long link, we should still throw (current behavior), to prevent the user from intermixing formats, exactly like the Unix tar CLI tool does.

carlossanlop commented 2 years ago

@Bio2hazard I'm really curious how you generated your archive, because even if I address the above by allowing intermixing formats, the magic and version of your entries are malformed in the 3rd entry: they indicate ustar instead of in gnu format, even though the entry type is L which is exclusive to the gnu format. And once again, I'm very surprised that SharpCompress and 7-zip were kind enough to allow it 😄 .

Bio2hazard commented 2 years ago

@carlossanlop

Yikes, I was unaware that the archive is in a malformed format.

So the original file that I first encountered this problem with was created by TeamCity Enterprise. A build agent designated a folder as the output artifact, and TeamCity automatically turned the folder into a .tar.gz. I think these are the relevant docs in particular the target_archive part.

These are the relevant build logs:

[10:46:02] Publishing artifacts (13s)
[10:46:02] [Publishing artifacts] Collecting files to publish: [+:project/dist/dev=>project-dev-12345-67890-abcdefg.tar.gz]
[10:46:02] [Publishing artifacts] Creating archive project-dev-12345-67890-abcdefg.tar.gz (8s)
[10:46:02] [Creating archive project-dev-12345-67890-abcdefg.tar.gz] Creating /opt/jetbrains/TeamCity_agent/temp/buildTmp/TarPreprocessor1234567890123456789/project-dev-12345-67890-abcdefg.tar.gz

Of course, I can't exactly share our internal build artifacts, so what I did was:

image

My hope here was that modifying the file in this manner would cause 7z to use the same tar format/style as the original archive produced by TeamCity.

While I understand that that is a fairly ... unorthodox way of creating an archive, I am actually fairly confident it is a faithful reproduction. Debugging the actual TeamCity archive and the supplied test archive shows that they both fail in the same way with the same stack traces.

Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part 😄

TeamCity is a big player in the CI/CD space, and the artifact publisher that produced this artifact is built-in and native. Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed. Otherwise I am sure there would be plenty of discussions about it.

Please let me know if you need anything else. If you are concerned that the reproduction archive isn't accurate enough I can provide the header bytes of the original archive.

Edit: To further make sure the reproduction is accurate:

carlossanlop commented 2 years ago

Thanks for the information, @Bio2hazard . A few more questions:

So the original file that I first encountered this problem with was created by TeamCity Enterprise

Do you happen to know what was the original format of the archive? I couldn't find it in the documentation you shared. The reason I ask is because you then mentioned:

Through the 7z GUI I modified the contents of the tar by adding the 3 test files, and deleting all the proprietary build artifacts

Which I interpret as: you deleted all the original files that the TeamCity tool generated, then added new files using 7-zip. So it's 7-zip who's adding the (seemingly) malformed files, not TeamCity.

Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part

Do you mean you tried reading the entries using the new TarReader API, or did you use some other tools? If you used TarReader and it failed, it was most likely because of the malformed magic and version. I should make it more flexible and avoid failing if the magic is not well formed, see if that helps with your case.

Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed.

I don't think the archive created with the TeamCity tool is generating malformed entries. I think it was 7-zip, since you did not preserve any of the original artifacts.

  • opened the copy with the 7z GUI, removed a file and saved it
  • opened the copy with the 7z GUI a second time, re-added the file and saved it
  • compared all 512-byte tar headers between the original and the 7z modified copy: They all matched

These 3 steps are confusing: Are you saying that some of the files in the .tar.gz file you shared are the original ones that were added by the TeamCity tool? If yes, which one was not removed?

If all the files were replaced using 7-zip, then 7-zip is making a bad job at detecting the format of existing entries, and adding entries in whatever format it wants.

Still, if you're saying 7-zip and SharpCompress both support intermixed format entries in a single archive, then we should support them as well. So I'll start working on that.

Bio2hazard commented 2 years ago

Sorry, you may be right. I'll try to clarify.

I created a LINQPad Script (gist) to read the headers of the files in the tar archive and compare the headers between 2 archives.

I was trying to determine if 7z changes the tar headers when files are added through the 7z GUI, as you are suspecting.

I extracted and untar'd the original archive as produced by teamcity, then made a copy of it and through the 7z GUI deleted, and then re-added a single file (from the extracted original archive). In that test, the header did not show up as changed by 7z, but I'm not convinced it was working right.

I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers.

At a glance, I can tell that some differences are that user / group are removed and that the file modes are different.

So for comparison, here is the Base64 of a tar header as created directly by TeamCity: YXNzZXRzLzAtNjA4ZWNhOWJiMWRlMjE4ZjFiZTUucG5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA2NDQgMDAzNTIzMCAwMDM1MjMwIDAwMDAwMTMzMzEwIDE0MjI1MDY1MzYwIDAyMDM0MwAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMHRlYW1jaXR5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAdGVhbWNpdHkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And the Base64 of the tar header for the same file as created by 7z: YXNzZXRzLzAtNjA4ZWNhOWJiMWRlMjE4ZjFiZTUucG5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMTMzMzEwADE0MjI1MDY1MzYwADAxMzIyMAAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=


Here is the Base64 of a long link tar header as created directly by TeamCity: Li8uL0BMb25nTGluawAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA2NDQgMDAwMDAwMCAwMDAwMDAwIDAwMDAwMDAwMTY0IDE0MjI1MDY1MzYwIDAxMTYzNwAgTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhciAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And the corresponding Base64 of a long link tar header as created by 7z: Li8uL0BMb25nTGluawAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMDAwMTY0ADE0MjI1MDY1MzYwADAxMDEwNgAgTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=


Finally, here is the Base64 of a long file name entry as created directly by TeamCity: YXNzZXRzL2ZlYXR1cmVzLmNoYW5uZWwtaGVhZGVyLmNvbXBvbmVudHMuc29jaWFsLWJ1dHRvbnMuY29tcG9uZW50cy5mb2xsb3dlci1lbW90ZXMtcHJvbXB0LTRlNzZjZmYyNjAxMDA2NDQgMDAzNTIzMCAwMDM1MjMwIDAwMDAwMDEzMzQwIDE0MjI1MDY1MzYwIDAzNjMzMwAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMHRlYW1jaXR5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAdGVhbWNpdHkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And here is the corresponding Base64 of a long file name entry as created by 7z: YXNzZXRzL2ZlYXR1cmVzLmNoYW5uZWwtaGVhZGVyLmNvbXBvbmVudHMuc29jaWFsLWJ1dHRvbnMuY29tcG9uZW50cy5mb2xsb3dlci1lbW90ZXMtcHJvbXB0LTRlNzZjZmYyADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMDEzMzQwADE0MjI1MDY1MzYwADAzMTEyMgAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

I think that should cover all cases as far as header differences goes.

It bears repeating that the original, completely unmodified, downloaded straight from teamcity .tar.gz archive failed to process with the new TarReader in the same manner as the test file I provided here. So even if the header of the teamcity entry is not malformed, it still failed in the same way.

Hope that helps and clears things up. I'm an SDE, not DevOps so I didn't have the option of producing a sharable/sanitized archive directly in teamcity.

carlossanlop commented 2 years ago

Thanks for spending time helping with this, @Bio2Hazard. I appreciate it a lot.

I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers.

Ah good to know, my suspicion was correct.

At a glance, I can tell that some differences are that user / group are removed and that the file modes are different.

This might be because you were doing this on a Windows machine. The concept of mode, uname and gname are from Unix. This is ok.

So even if the header of the teamcity entry is not malformed, it still failed in the same way.

This is interesting, and I think it's the only thing left for us to clarify, if that's ok. Do you mind sharing the exception message/callstack when using TarReader with the original TeamCity archive? I think the exception messages you shared in your previous comments came from the modified file, which contained entries that were added by 7-zip, not TeamCity. Feel free to obfuscate any sensitive information if needed.

Actually, if you have time and the chance to help a bit more: if you can download HxD Hex Editor and open the .tar file generated with TeamCity (not the .tar.gz file) you can see the raw bytes of the archive. The metadata is always located in the first 512 bytes of each entry. So for example, in the archive you shared that was modified with 7-zip, the magic and version fields of the first entry can be located in the blocks 257-264. If you scroll down, the magic and version fields of the LongLink entry are located in the blocks 9473-9480.

Can you share the magic and version fields of the LongLink entry in the TeamCity archive? This would help us determine if it's malformed or not.

Here's what I get from the first entry:

Here are the magic and version of the LongLink that 7-zip modified, they are malformed - Notice that it the value is ustar\000, which is the posix format, instead of ustar \0, which is the GNU format expected for a LongLink entry:

And again, thank you so much for testing this and for your patience.

Bio2hazard commented 2 years ago

@carlossanlop I'll get those stack traces for you

Can you share the magic and version fields of the LongLink entry in the TeamCity archive?

I wanted to quickly add that you should be able to just base64 decode the headers in my last message. They should have exactly what you are looking for. It's the first 512 bytes of the entries, from both teamcity and 7z.

carlossanlop commented 2 years ago

I wanted to quickly add that you should be able to just base64 decode the headers in my last message.

Oh cool, never done that. Allow me to figure it out then, and will come back to you if I don't find the info I wanted.

EDIT: this is AWESOME! https://www.base64decode.org/

Bio2hazard commented 2 years ago

@carlossanlop nice looks like it worked out! Base64 is just a quick and convenient way to share byte[]s through a text format. You can use Convert.FromBase64String("<base 64 header from before>") to get the raw byte[] in C# if that helps.

I got the stack traces for you:

WithSystemsFormatTar_PreDecompressed()

In this variant, the .tar.gz is first decompressed via GZipStream into a seekable MemoryStream, which is passed into TarReader

TeamCity Archive:

System.FormatException: Entry 'assets/features.channel-header.components.social-buttons.components.follower-emotes-prompt-4e76cff26' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_PreDecompressed|0_0(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 27
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Test Archive:

System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_PreDecompressed|0_0(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 27
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Stack Traces are identical


WithSystemsFormatTar_StreamedDecompression()

In this variant, the TarReader reads directly from the nonseekable GZipStream. No exception gets thrown here, but only a single entry is returned. The stack trace represents the part where it failed when processing the second file.

TeamCity Archive:

TarHelpers.IsAllNullBytes()at TarHelpers.cs:line 106
TarHeader.TryReadCommonAttributes()at TarHeader.Read.cs:line 309
TarHeader.TryGetNextHeader()
TarReader.TryGetNextEntryHeader()
TarReader.GetNextEntry()
Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression|0_1()
Program.<Main>$()

Test Archive:

TarHelpers.IsAllNullBytes()at TarHelpers.cs:line 106
TarHeader.TryReadCommonAttributes()at TarHeader.Read.cs:line 309
TarHeader.TryGetNextHeader()
TarReader.TryGetNextEntryHeader()
TarReader.GetNextEntry()
Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression|0_1()
Program.<Main>$()

The stack traces are again identical. This is the relevant failure point in TryReadCommonAttributes

// Empty checksum means this is an invalid (all blank) entry, finish early
Span<byte> spanChecksum = buffer.Slice(FieldLocations.Checksum, FieldLengths.Checksum);
if (TarHelpers.IsAllNullBytes(spanChecksum))
{
    return false;
}

The debugger showed all bytes of the checksum being null, which bubbles up and causes GetNextEntry to return false.


WithSystemsFormatTar_StreamedDecompression_WithCopyData()

In this variant, the TarReader reads directly from the nonseekable GZipStream just like the previous one. The only difference is that this uses tarReader.GetNextEntry(true) when reading entries.

TeamCity Archive:

System.IO.EndOfStreamException: Attempted to read past the end of the stream.
   at System.Formats.Tar.TarHelpers.CopyBytes(Stream origin, Stream destination, Int64 bytesToCopy)
   at System.Formats.Tar.TarHeader.GetDataStream(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarHeader.ProcessDataBlock(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression_WithCopyData|0_2(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 81
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Test Archive:

System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression_WithCopyData|0_2(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 81
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

This is the only test case where the result was different between the two. I hope that helps!