MicrosoftDocs / PowerShell-Docs

The official PowerShell documentation sources
https://learn.microsoft.com/powershell
Other
1.98k stars 1.58k forks source link

UTF-8 filename handling for Archive cmdlets #5450

Closed lbruun closed 4 years ago

lbruun commented 4 years ago

Please document how this cmdlet handles a ZIP produced with Windows Explorer "Compressed Folders" feature, where the entries in the ZIP contains non-ASCII-127 characters, for example if an entry is named "Plankalkül.dat". The cmdlet cannot correctly expand such an archive. Bottom line, the Expand-Archive cmdlet doesn't seem to be compatible with the majority of ZIP implementations out there in this respect, incl "Compressed Folders", 7-ZIP file manager, etc. Perhaps an additional switch on the cmdlet would solve the problem?


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

sdwheeler commented 4 years ago

@lbruun Thanks for the feedback. We will get the documentation updated. Please file a feature request in the source code repository at: https://github.com/PowerShell/PowerShell/issues/new/choose

sdwheeler commented 4 years ago

@lbruun I did some research into these cmdlets and the ZIP specification. The cmdlets are using the .NET ZipArchive class. So any change would have to happen there.

Compress-Archive stores the file names using UTF-8 encoding. Extract-Archive extracts the file with the proper character. 7zip stores the file name using Code Page 437 encoding, which encodes the "ü" character as 0x81. Extract-Archive extracts just writes the value stored. The problem is that there is no official standard for encoding characters in filenames. See Section D.4 in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT.

lbruun commented 4 years ago

Thanks @sdwheeler for very good comments. Appreciate the reaction time here.

Yes, I (now) understand the shortcomings of the ZIP spec.

Yes, the Compress-Archive is consistent with Extract-Archive as they use both UTF-8. So Extract-Archive can predictably unpack what was created with Compress-Archive. Check!

However, the point here is that Extract-Archive cannot predictably unzip a file created with native Windows (File Explorer "Compressed Folders" feature) and that is of course not what a user would expect. It should be documented.

I've opened Feature Request 11901 on the matter. This is for what I believe would be the most natural way to allow users be able to use the Expand-Archive on any zip file, no matter where it was created.

Btw: I think you are slightly wrong when you say that 7-Zip encodes file names as Code Page 437. More accurately it encodes file names using the host's OEM Code Page .. which may or may not be 437. On my system it is Code Page 850. The Windows Compressed Folders feature, as far as I can tell, does the same. It is really not 437 which is at play.

Therefore, my workaround at the moment is to use:

$enc = [System.Text.Encoding]::GetEncoding((Get-WinSystemLocale).TextInfo.OEMCodePage)
[System.IO.Compression.ZipFile]::ExtractToDirectory("myarchive.zip", ".", $enc)

Because of the shortcomings of the ZIP spec there's no way to tell which file name encoding the archive is using but in my case I know it hasn't been created by PowerShell itself and then I think Get-WinSystemLocale).TextInfo.OEMCodePage is the best guess, at least much better guess than 437.

sdwheeler commented 4 years ago

@lbruun Yes, we can add a note to the documentation about the behavior of Expand-Archive. That's a good note about your code page. I said code page 437 because that's what is listed in the Zip APPNOTE.TXT. But it seems to be implementation dependent. It would make sense to me to use the hosts OEMCodePage.