Aloxaf / zip-rs

Zip implementation in Rust
MIT License
0 stars 0 forks source link

GBK support #1

Open Aloxaf opened 6 years ago

Aloxaf commented 6 years ago

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

APPENDIX D - Language Encoding (EFS)

D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM).

D.3 Applications may choose to supplement this file name storage through the use of the 0x0008 Extra Field. Storage for this optional field is currently undefined, however it will be used to allow storing extended information on source or target encoding that may further assist applications with file name, or file content encoding tasks. Please contact PKWARE with any requirements on how this field should be used.

D.4 The 0x0008 Extra Field storage may be used with either setting for general purpose bit 11. Examples of the intended usage for this field is to store whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC. Similarly, other commonly used character encoding (code page) designations can be indicated through this field. Formalized values for use of the 0x0008 record remain undefined at this time. The definition for the layout of the 0x0008 field will be published when available. Use of the 0x0008 Extra Field provides for storing data within a ZIP file in an encoding other than IBM Code Page 437 or UTF-8.

D.5 General purpose bit 11 will not imply any encoding of file content or password. Values defining character encoding for file content or password must be stored within the 0x0008 Extended Language Encoding Extra Field.

D.6 Ed Gordon of the Info-ZIP group has defined a pair of "extra field" records that can be used to store UTF-8 file name and file comment fields. These records can be used for cases when the general purpose bit 11 method for storing UTF-8 data in the standard file name and comment fields is not desirable. A common case for this alternate method is if backward compatibility with older programs is required.

D.7 Definitions for the record structure of these fields are included above in the section on 3rd party mappings for "extra field" records. These records are identified by Header ID's 0x6375 (Info-ZIP Unicode Comment Extra Field) and 0x7075 (Info-ZIP Unicode Path Extra Field).

D.8 The choice of which storage method to use when writing a ZIP file is left to the implementation. Developers should expect that a ZIP file may contain either method and should provide support for reading data in either format. Use of general purpose bit 11 reduces storage requirements for file name data by not requiring additional "extra field" data for each file, but can result in older ZIP programs not being able to extract files. Use of the 0x6375 and 0x7075 records will result in a ZIP file that should always be readable by older ZIP programs, but requires more storage per file to write file name and/or file comment fields.

= = 混乱不堪...

根据描述大概是要到 0x0008 Extra Field 读取编码, 然后这一部分是这样的

4.5 Extensible data fields

4.5.1 In order to allow different programs and different types of information to be stored in the 'extra' field in .ZIP files, the following structure MUST be used for all programs storing data in this field:

   header1+data1 + header2+data2 . . .

Each header should consist of:

   Header ID - 2 bytes
   Data Size - 2 bytes

Note: all fields stored in Intel low-byte/high-byte order.

The Header ID field indicates the type of data that is in the following data block.

Header IDs of 0 thru 31 are reserved for use by PKWARE. The remaining IDs can be used by third party vendors for proprietary usage.

4.5.2 The current Header ID mappings defined by PKWARE are:

  0x0001        Zip64 extended information extra field
  0x0007        AV Info
  0x0008        Reserved for extended language encoding data (PFS)
                (see APPENDIX D)
  0x0009        OS/2
  0x000a        NTFS 
  0x000c        OpenVMS
  0x000d        UNIX
  0x000e        Reserved for file stream and fork descriptors
  0x000f        Patch Descriptor
  0x0014        PKCS#7 Store for X.509 Certificates
  0x0015        X.509 Certificate ID and Signature for 
                individual file
  0x0016        X.509 Certificate ID for Central Directory
  0x0017        Strong Encryption Header
  0x0018        Record Management Controls
  0x0019        PKCS#7 Encryption Recipient Certificate List
  0x0065        IBM S/390 (Z390), AS/400 (I400) attributes 
                - uncompressed
  0x0066        Reserved for IBM S/390 (Z390), AS/400 (I400) 
                attributes - compressed
  0x4690        POSZIP 4690 (reserved)