3MFConsortium / spec_core

3MF's Core specification
BSD 2-Clause "Simplified" License
55 stars 16 forks source link

Specification for names of OPC-parts leaves some open questions #42

Open martinweismann opened 3 years ago

martinweismann commented 3 years ago

https://github.com/3MFConsortium/spec_core/blob/master/3MF%20Core%20Specification.md#22-part-naming-recommendations does not give a hint to what characters should be used for OPC part names.

What does OPC actually say about that?

Other things (like partnumbers) are specified well: https://github.com/3MFConsortium/spec_core/blob/master/3MF%20Core%20Specification.md#3431-item-element refers to standard the standard xml simple type xs:string (https://www.w3.org/TR/xmlschema11-2/#string)

jordig100 commented 3 years ago

OPC Section 9.1.1.1 Part Name Syntax has the following: 9.1.1.1 Part Name Syntax A Part name shall be an IRI and shall be encoded as either a Part IRI or a Part URI. A Part IRI is a physical representation that permits direct use of Unicode characters. A Part URI is a physical representation that uses a percent-encoding for non-ASCII Unicode characters. [Note: Not all versions of the ZIP specification support a Part name represented as a Part IRI. To preserve interoperability, implementers are encouraged to use the currently more prevalent Part URI representation. end note]

An URI allows unicode characters while Part URI only allows ASCII, with escaped sequences for Unicode characters. It recommends to use URI for better interoperability. But it doesn't specify that Part IRI is incorrect. What I interpret here is that producers MUST use Part URI representation, while consumers MIGHT still support Part IRI.

bubnikv commented 3 years ago

OPC Section 9.1.1.1 Part Name Syntax has the following:

Would you please refer to your source?

Because the RFC3987 defining IRI specifies its encoding with percent prefixes in a similar way to URI https://www.ietf.org/rfc/rfc3987.txt see section "2.2. ABNF for IRI References and IRIs"

I don't understand how the consumer would understand whether the names are URI vs. IRI encoded if IRI was not using the percent escape rule.

bubnikv commented 3 years ago

OPC Section 9.1.1.1 Part Name Syntax has the following:

Would you please refer to your source?

I found it.

The "OPC Section 9.1.1.1 Part Name Syntax" says

A Part IRI is a physical representation that permits direct use of Unicode characters.

But I think this is not correct. Indeed, just the next section

9.1.1.1.1 Part IRI syntax

which is copied from RFC3987 defines the IRI encoding, which does NOT allow direct use of Unicode characters. I think OPC Section 1.1.1. needs amendment.

bubnikv commented 3 years ago

OPC specification

https://www.ecma-international.org/publications-and-standards/standards/ecma-376/

Ecma Office Open XML Part 2 - Open Packaging Conventions.pdf

bubnikv commented 2 years ago

According to OPC spec, the OPC parts could be URI or IRI encoded. The open points are: 1) Shall we recommend URI over IRI for backwards compatibility? We believe we shall recommend URI because at least Microsoft OPC implementation in Windows 10 does not seem to understand IRI, see below our Report - Open Packaging Conventions. 2) Does IRI allow using plain UTF-8? That is a longer one.

According to https://datatracker.ietf.org/doc/html/rfc3987#section-2.1

IRIs are defined similarly to URIs in [RFC3986], but the class of unreserved characters is extended by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject to the limitations given in the syntax rules below and in section 6.1.

But that is the only mention of "unreserved" characters in the RFC3987. I understand this paragraph as "characters beyond U+007F, subject to the limitations ... could be stored into IRI verbatim without encoding. However https://datatracker.ietf.org/doc/html/rfc3987#section-2.2 does not reference the "unreserved" characters at all. IMHO the RFC3987 is ambiguous and not quite complete.

This stackoverflow post seems to explain a lot. https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/36667242#36667242

Then RFC 3987 extends that set of unreserved characters with the following Unicode character ranges: %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xD0000-DFFFD / %xE1000-EFFFD

It is my understanding, that IRI allows the UTF8 characters above %xA0 to be stored directly without escaping (with some exceptions), while the rest of the characters still need to be escaped.

Thus the "OPC Section 9.1.1.1 Part Name Syntax" says

A Part IRI is a physical representation that permits direct use of Unicode characters.

which is quite imprecise. It shall say

A Part IRI is a physical representation that permits direct use of Unicode characters above %xA0 with some exceptions.

I wonder whether any 3MF consumer / producer ever stored a character that should have been escaped based on the URI or IRI specification but it was not. PrusaSlicer luckily only generates part names with printable 7 bit characters and it may be the case of other producers as well. If it is not the case and those part names were NOT URI / IRI encoded, enforcing URI / IRI encoding may break backwards compatibility with existing 3MFs.

I believe the OPC part names specification is clear now.

For names or identifiers other than OPC part names, I believe we do not have to worry as we declare our XMLs as UTF-8 encoded. As long as these other names do not address a ZIP directory entry and they are not pointing to a file or URL, names and IDs may use UTF-8 charset without any limitation. If used as identifiers, there is a risk of two IDs that should be equal but they are not, as one is canonical and the other not. For example, the Czech character 'ú' could be encoded in UTF8 as sequence of two characters: a dash and 'u', or as a single 'ú', while they will both be displayed the same (or nearly the same). Second issue may be that some client may not be able to display an ID because it misses some fonts (for example Chinese fonts may not be installed on his machine).

Report from my college Lukas follows:

Report - Open Packaging Conventions According to the OPC (Open Packaging Conventions) specification ECMA-376-2 (https://www.ecma-international.org/publications-and-standards/standards/ecma-376/), all Part Names should be stored encoded as URI or IRI. In the specification it is also recommended to use URI instead of IRI because not all ZIP implementations support storing filenames in encoded as IRI, because UTF-8 filenames were added to the PKWARE PKZIP specification (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) with version 6.3.0 published 09/29/2006. This is most likely related to the fact that the first release of the OPC from December 2006 mentioned storing the Part Name only encoded as URI, not in IRI. The possibility to store Part Name encoded as IRI was introduced with the second release of OPC specification in December 2008.

Microsoft OPC implementation

Because the OPC specification was initiated by Microsoft, we went ahead and tested using Microsoft's own tools. While MS Word / Excel only generate simple OPC part names, Microsoft 3D Builder allows saving files with custom names (for example, an image used as a texture) in 3MF (OPC package) files. According to our tests, custom names containing UTF-8 characters are always stored encoded as URI. The URI encoded Part Name is used for both storing inside the ZIP file header and also within XML (3dmodel.model, etc.). We tried to create a custom 3MF containing a Part Name encoded as IRI (containing UTF-8 characters such as Czech "čš"), but Microsoft 3D Builder was unable to load this 3MF file. This problem may be due to unsupported UTF-8 characters in the ZIP filenames instead of trouble with loading Part Name encoded as IRI.

According to https://docs.microsoft.com/en-us/windows/win32/api/_opc/, OPC packages can also be created through the Win32 API calls. To insert a new Part Name, you need to call the method IOpcPartSet::CreatePart, which takes as the parameter the interface IOpcPartUri created by the method IOpcFactor::CreatePartUri. This function ensures that each input passed is encoded as URI before it is inserted into the OPC package. We haven't found a way to create Part Name through the Win32 API encoded as IRI instead of URI.

Summarization Based on these findings, at least Microsoft 3D Builder does not allow loading 3MF (OPC package) containing Part Name encoded as IRI (containing UTF-8 characters). Microsoft 3D Builder is forcing Part Name to be encoded with URI, and this is the same for the Win32 API that offers only storing Part Name encoded as URI.

bubnikv commented 2 years ago

Here is an example of an URI encoded texture file name produced by Microsoft 3D Builder, containing non-7bit ASCII characters:

3DBuilder-URI.zip

Content of 3D\_rels\3dmodel.model.rels

<?xml version="1.0" encoding="UTF-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Target="/3D/Texture/pkus%2B%C4%9B%2B%C5%A1%C5%99%C4%8D%C5%A1%C5%99%C3%BD%C4%8D%C5%99%C3%BD%C5%99%C5%BE%C3%AD%C3%A1%C3%BD%C3%A9%3D%C2%B4%C2%B4%3D%3D%C3%BA%29%C2%A8%C5%AF%C2%A7%2C.-%3B%C2%B0%C2%B01235%25%CB%87%28%27%21_.png" 
Id="rel45876484" Type="http://schemas.microsoft.com/3dmanufacturing/2013/01/3dtexture" />
</Relationships>

The decoded relationship target as shown by https://www.urldecoder.io/ image

The file name as stored inside the 3MF ZIP package: image

The file name is clearly URI encoded by Microsoft 3D Builder. Most likely Microsoft uses the same OPC implementation for various OPC derived formats, thus most likely they use URI only across the board.