dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

ZipArchive.GetEntry() doesn't work properly after adding 2 entries with the same path, and removing one of them #51051

Closed dovisutu closed 2 years ago

dovisutu commented 3 years ago

Description

After creating 2 entries with the same path ~(which is allowed by the standard, while not recommended)~ in a ZipArchive and deleting one of the entries, ZipArchive.GetEntry() w/ the path cannot retrieve the remaining entry, despite being already written into the archive, being able to be retrieved if the archive was readed afterwards.

Minimal reproduce:

using System;
using System.IO;
using System.IO.Compression;
public class Program
{
    static void Main(string[] args)
    {
        {
            using var stream = File.Create("..\\..\\..\\testZip.zip") ;
            using var archive = new ZipArchive(stream, ZipArchiveMode.Update);
            var text1 = "1234567890987654321";
            var text2 = "0987654321234567890";
            var previousEntry = archive.CreateEntry("test.txt");
            using (var writer = new StreamWriter(previousEntry.Open()))
            {
                writer.Write(text1);
                writer.Flush();
            }
            using (var writer = new StreamWriter(archive.CreateEntry("test.txt").Open()))
            {
                writer.Write(text2);
                writer.Flush();
            }
            previousEntry.Delete();
            var firstResult = archive.GetEntry("test.txt");
            Console.WriteLine("Looking for test.txt the first time...");
            ValidateResult(firstResult);
        }
        {
            using var stream = File.OpenRead("..\\..\\..\\testZip.zip");
            using var archive = new ZipArchive(stream);
            var secondResult = archive.GetEntry("test.txt");
            Console.WriteLine("Looking for test.txt the second time...");
            ValidateResult(secondResult);
        }
    }
    public static void ValidateResult(ZipArchiveEntry searchResult)
    {
        if (searchResult == null)
        {
            Console.WriteLine("Did not find test.txt in the archive.");
        }
        else
        {
            using var reader = new StreamReader(searchResult.Open());
            Console.WriteLine("Found test.txt in the archive. Content: {0}", reader.ReadToEnd());
        }
    }
}

Expected output:

Looking for test.txt the first time...
Found test.txt in the archive. Content: 0987654321234567890
Looking for test.txt the second time...
Found test.txt in the archive. Content: 0987654321234567890

Actual output:

Looking for test.txt the first time...
Did not find test.txt in the archive.
Looking for test.txt the second time...
Found test.txt in the archive. Content: 0987654321234567890

Configuration

dotnet version: 5.0.2 (runtime) OS & version: Windows 20H2 19042.867 architecture: x64

Other information

Possible where the issue occured: When adding an entry, ZipArchive.CreateEntry() calls ZipArchive.DoCreateEntry(), which then calls ZipArchive.AddEntry(), which looks like this: https://github.com/dotnet/runtime/blob/0af5228e8fbf24f4fecfe292c0df732e64ae13c2/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L421-L430 Here, the attempted addition to _entriesDictionary is discarded when multiple entries with the same name occur. This, while prevented an exception from being thrown, caused a loss of information of the second entry, although it's still recorded in _entries. When deleting an entry, ZipArchiveEntry.Delete() calls ZipArchive.RemoveEntry(), which looks like this: https://github.com/dotnet/runtime/blob/0af5228e8fbf24f4fecfe292c0df732e64ae13c2/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs#L263-L279 which removes the record of an entry with such name, regardless of it being present multiple times. When searching for an entry: https://github.com/dotnet/runtime/blob/0af5228e8fbf24f4fecfe292c0df732e64ae13c2/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L329-L340 Here, only the attempt to search for the entry in _entriesDictionary happens, therefore unable to retrieve the entry while being present in the archive and _entries.

ghost commented 3 years ago

Tagging subscribers to this area: @carlossanlop See info in area-owners.md if you want to be subscribed.

Issue Details
### Description After creating 2 entries with the same path (which is allowed by the standard, while not recommended) in a `ZipArchive` and deleting one of the entries, `ZipArchive.GetEntry()` w/ the path cannot retrieve the remaining entry, despite being already written into the archive, being able to be retrieved if the archive was readed afterwards. #### Minimal reproduce: ```csharp using System; using System.IO; using System.IO.Compression; public class Program { static void Main(string[] args) { { using var stream = File.Create("..\\..\\..\\testZip.zip") ; using var archive = new ZipArchive(stream, ZipArchiveMode.Update); var text1 = "1234567890987654321"; var text2 = "0987654321234567890"; var previousEntry = archive.CreateEntry("test.txt"); using (var writer = new StreamWriter(previousEntry.Open())) { writer.Write(text1); writer.Flush(); } using (var writer = new StreamWriter(archive.CreateEntry("test.txt").Open())) { writer.Write(text2); writer.Flush(); } previousEntry.Delete(); var firstResult = archive.GetEntry("test.txt"); Console.WriteLine("Looking for test.txt the first time..."); ValidateResult(firstResult); } { using var stream = File.OpenRead("..\\..\\..\\testZip.zip"); using var archive = new ZipArchive(stream); var secondResult = archive.GetEntry("test.txt"); Console.WriteLine("Looking for test.txt the second time..."); ValidateResult(secondResult); } } public static void ValidateResult(ZipArchiveEntry searchResult) { if (searchResult == null) { Console.WriteLine("Did not find test.txt in the archive."); } else { using var reader = new StreamReader(searchResult.Open()); Console.WriteLine("Found test.txt in the archive. Content: {0}", reader.ReadToEnd()); } } } ``` Expected output: ``` Looking for test.txt the first time... Found test.txt in the archive. Content: 0987654321234567890 Looking for test.txt the second time... Found test.txt in the archive. Content: 0987654321234567890 ``` Actual output: ``` Looking for test.txt the first time... Did not find test.txt in the archive. Looking for test.txt the second time... Found test.txt in the archive. Content: 0987654321234567890 ``` ### Configuration dotnet version: 5.0.2 (runtime) OS & version: Windows 20H2 19042.867 architecture: x64 ### Other information Possible where the issue occured: When adding an entry, `ZipArchive.CreateEntry()` calls `ZipArchive.DoCreateEntry()`, which then calls `ZipArchive.AddEntry()`, which looks like this: https://github.com/dotnet/runtime/blob/0af5228e8fbf24f4fecfe292c0df732e64ae13c2/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L421-L430 Here, the attempted addition to `_entriesDictionary` is discarded when multiple entries with the same name occur. This, while prevented an exception from being thrown, caused a loss of information of the second entry, although it's still recorded in `_entries`. When deleting an entry, `ZipArchiveEntry.Delete()` calls `ZipArchive.RemoveEntry()`, which looks like this: https://github.com/dotnet/runtime/blob/0af5228e8fbf24f4fecfe292c0df732e64ae13c2/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs#L263-L279 which removes the record of an entry with such name, regardless of it being present multiple times. When searching for an entry: https://github.com/dotnet/runtime/blob/0af5228e8fbf24f4fecfe292c0df732e64ae13c2/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L329-L340 Here, only the attempt to search for the entry in `_entriesDictionary` happens, therefore unable to retrieve the entry while being present in the archive and `_entries`.
Author: dovisutu
Assignees: -
Labels: `area-System.IO.Compression`, `untriaged`
Milestone: -
danmoseley commented 3 years ago

2 entries with the same path (which is allowed by the standard, while not recommended)

@dovisutu do you have a reason to do this? I'm wondering whether this bug matters in practice.

Also do you know whether there are issues reading archives with such entries?

dovisutu commented 3 years ago

Sorry for the late response :) It may occur that one attempts to avoid files with the same name, but fails to write the proper code (like I did long ago...) by adding an entry before deleting the existing one. The expected result should be either directly throwing an exception (if it's not supposed to do so), or dealing with it properly, instead of this weird behavior, which may confuse some programmers.

Also tested with reading archives with such entries (with the same name), it appears that only the first file that was written can be retrieved via GetEntry. Didn't test other methods though, like property ZipArchive.Entries.

carlossanlop commented 2 years ago

After creating 2 entries with the same path (which is allowed by the standard, while not recommended)

@dovisutu can you point out where in the spec is this behavior described as allowed? I can't find anything related to duplicate names/paths: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Maybe the problem you are reported could be related to case sensitivity on Windows? https://github.com/dotnet/runtime/issues/48267

dovisutu commented 2 years ago

I didn't find explicit statement that allow name duplication either, yet neither did it state that name has to be unique (although that doesn't necessarily indicate an allowance for duplicated name). I'll edit the issue description. Interestingly, programs like WinRAR are capable of reading and recognizing duplicated file names, yet upon opening specific one it seems to decompress all such files, asking which to preserve. (But this is not part of this issue, just for reference.)

No, it's not case sensitivity, for the files created are with exactly the same name in the zip file (but obviously not outside the file),

As for now, the library is in a state of accepting duplicated names and producing unexpected results than one would anticipate. My suggestion is to either (if deciding to avoid name duplication) throw when attempting to create such entries (but that can be done in another issue), or (if deciding to keep name duplication) properly handle such entries.

carlossanlop commented 2 years ago

programs like WinRAR are capable of reading and recognizing duplicated file names, yet upon opening specific one it seems to decompress all such files, asking which to preserve

That is the advantage of having a UI. The user can be prompted with such decisions.

I think this is a rare scenario, but if we were to propose a fix, I would lean towards throwing if an existing item has the same name.

pedrobsaila commented 2 years ago

Hi. I made a PR #60973 for this issue

TonyValenti commented 2 years ago

Hi all, Is it possible to create a flag that allows adding and reading duplicate entries with the same name?

I do work with digital forensics and this is something we specifically train trainees to look for. It is a very sneaky tactic to hide files inside a zip and obfuscate their existence in that way.

carlossanlop commented 2 years ago

Is it possible to create a flag that allows adding and reading duplicate entries with the same name?

@TonyValenti You could open an API proposal for that (here's the template), but I'm not sure if that's a feature that we would take, since it's not specifically supported in the spec.