System.IO.Compression handles extended characters incorrectly

vsfeedback commented 3 years ago

This issue has been moved from a ticket on Developer Community.

[severity:It's more difficult to complete my work] create a text file name "tämä.txt" send to compressed folder Extracting this file it is read by library as "t„m„.txt"

Windows explorer expands the zip fine, as other compression software's

using System.IO;
using System.IO.Compression;

namespace ZIPtest
{
    class Program
    {
        static void Main(string[] args)
        {
            string zipFile = @"C:\temp\tämä.zip";

using (ZipArchive paketti = ZipFile.OpenRead(zipFile))
            {
                foreach (ZipArchiveEntry entry in paketti. Entries)
                {
                    string nimi = Path.Combine(@"c:\temp", entry. Name);
                    entry. ExtractToFile(nimi);
                }

}
        }
    }
}

Original Comments

Feedback Bot on 9/20/2020, 11:00 PM:

We have directed your feedback to the appropriate engineering team for further evaluation. The team will review the feedback and notify you about the next steps.

Original Solutions

Tarek Mahmoud Sayed [MSFT] solved on 9/30/2020, 01:44 PM, 0 votes:

Thanks for sending the issue. I have tried it the same thing and I was not able to reproduce the issue. My guess here is the problem is not really the ZipArchiveEntry.Name content but it could be the Visual Studio locals Window displaying the string differently and this can be depending on the configuration of your machine. Usually I am seeing this maybe depending on the default codepage on your system.

Here is the code I tried:

        using (ZipArchive paketti = ZipFile.OpenRead(@"F:\temp\release.zip"))
        {
            foreach (ZipArchiveEntry entry in paketti.Entries)
            {
                Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
            }
        }

And this printed the output:

  tämä.txt ... \u0074\u00e4\u006d\u00e4\u002e\u0074\u0078\u0074

which is correct.

I suggest you can try the same either manually printing the characters ordinal values and check your system configuration like default locale and codepage in the system.

on 10/8/2020, 05:05 AM:

(private comment, text removed)

Tarek Mahmoud Sayed [MSFT] on 10/8/2020, 10:10 AM:

Thanks for your reply. Unfortunately I cannot reach the files that you have attached. Could you please try to upload them again and let me know.

Also, please attach the code that you used to create the zip file again to see how did you include the file tämä.txt too.

on 10/9/2020, 02:29 AM:

(private comment, text removed)

Tarek Mahmoud Sayed [MSFT] on 10/9/2020, 11:01 AM:

Thanks for the details. I'll take a look.

tarekgh commented 3 years ago

I have looked at this issue and I am seeing the Zip archive entry names encoding is not handled correctly. When creating a new archive file there will be a generic flag telling if the archive is encoded using UTF-8 or not. if this flag is off, means the archive is not encoded using UTF-8, we don't handle the right encoding at that time. the following comment has more details:

https://github.com/dotnet/runtime/blob/b75f8d9fd1b3bcac5d82469bf39f1aea2c3a1652/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L356

Usually when UTF-8 encoding is not used, we should consider using the 437 encoding instead. Or try to investigate more the details as described in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT in Appendix D.

This issue is kind of obvious to the users because when using Windows shell to create a new Archive, it doesn't create the archive using UTF-8 and I am seeing it is using encoding 437.

jyrijh commented 3 years ago

Thank you. Now I know what is happening and this will be fixed.

But in the mean time I could do this, and force codepage 437 to get correct value from _storedEntryNameBytes. entry.Name stays incorrect.

I have to test if this works correctly in all cases

string zipFile = @"C:\temp\tämä.zip";            
var ibm437 = Encoding.GetEncoding("IBM437");

using (ZipArchive paketti = ZipFile.OpenRead(zipFile))
{
    var entryEncoding = paketti.GetType().
        GetProperties(BindingFlags.NonPublic | BindingFlags.Instance).
        Single(pi => pi.Name == "EntryNameEncoding");

    entryEncoding.SetValue(paketti, ibm437);

    foreach (ZipArchiveEntry entry in paketti.Entries)
    {
        var name = entry.GetType().GetFields(BindingFlags.NonPublic | BindingFlags.Instance).Single(pi => pi.Name == "_storedEntryNameBytes");
        var nameValue = name.GetValue(entry);
        var result = ibm437.GetString((byte[])nameValue);

        Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
        Console.WriteLine($"{result} ... {DumpString(result)}");
    }
}

jyrijh commented 3 years ago

ZipFile.Open(String, ZipArchiveMode, Encoding)

Now I just need to figure out what encoding has been used

jyrijh commented 3 years ago

With this I came up. Minimal detection for encoding and after quick testing it works.

using System;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;

namespace ZIPtest
{
    class Program
    {
        [Flags]
        public enum GeneralPurpose : ushort
        {
            None = 0,
            bit00 = 1 << 0,
            bit01 = 1 << 1,
            bit02 = 1 << 2,
            bit03 = 1 << 3,
            bit04 = 1 << 4,
            bit05 = 1 << 5,
            bit06 = 1 << 6,
            bit07 = 1 << 7,
            bit08 = 1 << 8,
            bit09 = 1 << 9,
            bit10 = 1 << 10,
            languageEncoding = 1 << 11,
            bit12 = 1 << 12,
            bit13 = 1 << 13,
            bit14 = 1 << 14,
            bit15 = 1 << 15
        }

        [StructLayout(LayoutKind.Sequential, Pack = 1)]
        struct Header
        {
            [MarshalAs(UnmanagedType.ByValArray, SizeConst = 4)]
            public byte[] signature;
            public ushort version;
            public GeneralPurpose generalPurposeBit;
        }

        static void Main()
        {
            string filename = @"C:\temp\tämä.zip";
            var encoding = DetectEncoding(filename);
            CheckNames(filename, encoding);

            filename = @"C:\temp\漢字.zip";
            encoding = DetectEncoding(filename);
            CheckNames(filename, encoding);

            void CheckNames(string zipName, Encoding zipEncoding)
            {
                using (ZipArchive paketti = ZipFile.Open(zipName, ZipArchiveMode.Read, zipEncoding))
                {
                    foreach (ZipArchiveEntry entry in paketti.Entries)
                    {
                        Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
                    }
                }
            }

            string DumpString(string name) => string.Join(", ", name.ToArray().Select(x => $"{(int)x:X4}"));
        }

        static Encoding DetectEncoding(string file)
        {
            var data = new byte[8];
            using (FileStream sr = new FileStream(file, FileMode.Open))
                sr.Read(data, 0, 8);

            GCHandle gch = GCHandle.Alloc(data, GCHandleType.Pinned);
            Header header = (Header)Marshal.PtrToStructure(gch.AddrOfPinnedObject(), typeof(Header));
            gch.Free();

            if (header.generalPurposeBit.HasFlag(GeneralPurpose.languageEncoding))
                return Encoding.UTF8;
            else
                return Encoding.GetEncoding("IBM437");
        }
    }
}

jyrijh commented 3 years ago

And no it didn't work as intended.

if zip file has these in it, they will appear in this order, and my code will get encoding from first entry.

aa 漢字.txt
äÄöÖåÅüÜéÉ.txt
漢字.txt

And then the second filename is actually 437 encoded. IO.Compression will handle everything with same encoding when it should handle it per entry.

tarekgh commented 3 years ago

ZipFile.Open(String, ZipArchiveMode, Encoding) can be used to choose the specific encoding and should help if you know the encoding.

For https://github.com/dotnet/runtime/issues/43231#issuecomment-707643592, this may work in some cases but not in general cases as you indicated later. if you look at pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT it says

D.3 Applications MAY choose to supplement this file name storage through the use 
of the 0x0008 Extra Field.  Storage for this optional field is currently 
undefined, however it will be used to allow storing extended information 
on source or target encoding that MAY further assist applications with file 
name, or file content encoding tasks.  Please contact PKWARE with any
requirements on how this field SHOULD be used.

jyrijh commented 2 years ago

After giving this some thought, exposing the header for entries would be suitable solution for this. Then it would be up to programmer to check those values if they are interested in them.

cremor commented 2 years ago

I also had ZIP entry encoding problems recently and this is what I figured out:

Archives created by .NET itself can be extracted without specifying an encoding just fine, so the automatically detected encoding is correct for those.
Archives created by the Windows shell extension or by 7-Zip (on Windows) can not correctly be extracted with .NET without specifying an encoding manually. More information here: https://stackoverflow.com/a/32443735/631802
Archives created on macOS also can not be correctly extracted with .NET. This is also true if the Windows workaround is applied. This means that it is not possible to extract both Windows and macOS archives with the same encoding workaround.

I think the runtime should handle all the encoding detection work.

jyrijh commented 2 years ago

This is exactly why the language encoding bit should be exposed. if it is other that UTF-8 then runtime has no way of knowing what the codepage should be. Me as developer can "guess" what codepage to use by knowing where the file is coming from.

cremor commented 2 years ago

if it is other that UTF-8 then runtime has no way of knowing what the codepage should be

How are standalone archiving tools like 7-Zip are handling that case then? 7-Zip can extract all archives that I tested with correct encodings. If 7-Zip can do it then the .NET runtime can do it for sure too?

jyrijh commented 2 years ago

If you create an archive that has any other encoding that UTF-8 or IBM437 it doesn't.

7-zip will extract this ")=.txt" as "]~Kúºú"

So I need to know where the file came from and have knowledge what codepage they used. Usually it should be UTF-8 or IBM437, but even then I need to know then per entry what the language encoding bit is

var zipFile = @"c:\temp\test.zip";
var name = ")=.txt";
var encoding = Encoding.GetEncoding("IBM297");

using (FileStream stream = new FileStream(zipFile, FileMode.OpenOrCreate))
{
    using (ZipArchive archive = new ZipArchive(stream, ZipArchiveMode.Update, false, encoding))
    {
        var entry = archive.CreateEntry(name);
        using (StreamWriter writer = new StreamWriter(entry.Open()))
        {
            writer.WriteLine("text");
        }
    }
}

cremor commented 2 years ago

Yeah ok, but I meant archives that were created with "standard" tools. .NET should be able to handle those automatically.

cremor commented 2 years ago

I just found this comment in the .NET source code: https://source.dot.net/#System.IO.Compression/System/IO/Compression/ZipArchive.cs,356

What's weird about that is that it seems to be wrong. Specifically the part where it says

For instance, the Windows Shell Zip tool takes "something else" to mean "the local system codepage". We default to the same behaviour, [...]

In my experience this is not the case, see https://github.com/dotnet/runtime/issues/43231#issuecomment-1008834276. The Windows Shell Zip tool seems to use "IBM850" (not "IBM437"!) encoding (on Windows 10). And because .NET does not default to that (but instead defaults to "the current system default code page"), those Windows zip files can not be correctly extracted by .NET.

Maybe .NET just understands "the current system default code page" different than Windows. As written, Windows seems to use "IBM850" encoding, which is available in .NET as CultureInfo.CurrentCulture.TextInfo.OEMCodePage. But .NET seems to think that the default is "Windows-1252" (available in Encoding.Default).

Anyway, even if it would work by default for zip files created on Windows, we would still have problems with zip files created on macOS.

tarekgh commented 2 years ago

Maybe .NET just understands "the current system default code page" different than Windows. As written, Windows seems to use "IBM850" encoding, which is available in .NET as CultureInfo.CurrentCulture.TextInfo.OEMCodePage. But .NET seems to think that the default is "Windows-1252" (available in Encoding.Default).

If you are running with full framework, default code page would be picked up from Windows API GetACP. If running with .NET or .NET Core, the default codepage would be UTF-8. to get the default system codepage in .NET, need to register the codepages encoding as described in https://docs.microsoft.com/en-us/dotnet/api/system.text.codepagesencodingprovider?view=net-6.0#remarks and then call Encoding.GetEncoding(0).

cremor commented 2 years ago

I tested on .NET Framework 4.8.

But as far as I understand it .NET Core/5/6 will have the same problem, right? Because neither uses the same ZIP entry name encoding by default that Windows uses.

tarekgh commented 2 years ago

But as far as I understand it .NET Core/5/6 will have the same problem, right?

Yes. I was just explaining how Encoding picks the default.

caverna commented 1 year ago

I tested on .NET Framework 4.8.

But as far as I understand it .NET Core/5/6 will have the same problem, right? Because neither uses the same ZIP entry name encoding by default that Windows uses.

yep! I reached this page having the same issue and it fails for .NET Core 7

Stallind commented 7 months ago

Commeting here since I found the thread googling about the same issue.

For .net framework 4.8 it works for me using Encoding.GetEncoding(System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage).

For .net core I used a somewhat ugly hack where I replace the "illegal characters". In this case swedish åäö.

var replacements = new Dictionary<string,string> { 
    { "\u0086", "å" },
    { "\u008f", "Å"},
    { "\u0084", "ä"},
    { "\u008e", "Ä"},
    { "\u0094", "ö"},
    { "\u0099", "Ö"} 
};
using (ZipArchive source = System.IO.Compression.ZipFile.Open(@"C:\New Folder\testing.zip", ZipArchiveMode.Read, Encoding.GetEncoding("ISO-8859-1")))
{
    foreach (ZipArchiveEntry entry in source.Entries)
    {
        var name = entry.Name;
        foreach (var replacement in replacements)
        {
            name = name.Replace(replacement.Key, replacement.Value);
        }
        Console.WriteLine(name);
    }
}

I can't say this is the way to go and that I recommend it, but it solved the issue for me.

dotnet / runtime