Code review - Githubissues

ezzatron commented 9 years ago

There are a few areas of this library that would benefit from expert review. In particular, there are several areas of the RFC that I'm not 100% sure I have covered:

Case mapping

If case mapping is desired (instead of case preservation), it is RECOMMENDED to use Unicode Default Case Folding as defined in the Unicode Standard [Unicode] (at the time of this writing, the algorithm is specified in Chapter 3 of [Unicode7.0]).

In reality, I'm using String.prototype.toLowerCase, which is defined here for ECMAScript 6. For practical purposes, I think this should be fine, but confirmation would be great.

Normalization

The normalization rule of a profile specifies which Unicode normalization form (D, KD, C, or KC) is to be applied (see Unicode Standard Annex #15 [UAX15] for background information).

Similarly, I am not directly implementing Unicode normalization. Where possible, I am relying upon String.prototype.normalize, which is defined here for ECMAScript 6. In non-ES6 environments I am falling back to unorm, the de-facto standard for Unicode normalization in JavaScript. I'm somewhat confident in both of these implementations, as they reference the same document as the PRECIS RFC.

Dependencies and their Unicode versions

Something I am concerned about is the fact that these dependencies may, at certain times, be lagging behind in the version of the Unicode Character Database (UCD) that they use for their own data. I don't know that there's much I can do about this, but a better understanding of the consequences would be useful.

Character encodings

Another point of concern is JavaScript's internal character encoding, UCS-2. Both the Usernames and Passwords, and the Nickname draft specification for PRECIS profiles include the following statement:

In addition, the string MUST be encoded as UTF-8 [RFC3629].

I feel this statement may actually be detrimental to the drafts. It would be prohibitively complicated to do UCS-2 to UTF-8 conversion (and vice versa) in a browser, for example. In practice, the encoding conversion, if required at all, can quite happily occur before or after PRECIS.

The other thing that makes me question the wisdom of specifying the encoding is this statement taken directly from the PRECIS RFC proper:

Although strings that are consumed in PRECIS-based application protocols are often encoded using UTF-8 [RFC3629], the exact encoding is a matter for the application protocol that uses PRECIS, not for the PRECIS framework.

This statement seems to be at direct odds with the previous one.

Unicode 8

This one is fairly simple, my current data generation scripts are not yet working with Unicode 8. This is partly due to a dependency on codepoints, which has not yet been updated to work with Unicode 8. I do not yet know if PrecisMaker works with Unicode 8, but that is also a requirement.

Unicode 8 support now working.

ezzatron commented 9 years ago

@stpeter; I mentioned to you on Twitter recently that I was working on a JavaScript implementation of the PRECIS framework. This library is my working implementation, but there are a few areas I'm uncertain about, and I would really appreciate your advice on. I'm mentioning you in a ticket that details my main areas of concern.

sonnyp commented 8 years ago

maybe @mathiasbynens has some interest in this or can answer some of the concerns

stpeter commented 8 years ago

I will find time soon to answer these questions!

stpeter commented 8 years ago

I have yet to review the ES6 links you provided, but I'll reply briefly to a few of your questions...

First, toLower is not the same as case folding. We had some discussion about this recently on the PRECIS WG list, for instance:

https://mailarchive.ietf.org/arch/msg/precis/Jua858TgnE6GFGb5XUnPqPK3CmQ

https://mailarchive.ietf.org/arch/msg/precis/-sF4quVE-xnm-DMSbP4X1VXGlc4

I will post more about this after I've reviewed the ES6 methods.

Second, I will look at the ES6 normalization methods as well as unorm and let you know what I think. Naturally if there are problems in those APIs then it's not your job to fix them. :-)

Third, there's not much to be done about Unicode versions. You use whatever is at your disposal. In general, newer versions of Unicode mostly just add new code points. Sometimes they modify existing code points, but the code points that they modify tend to be ones that were added more recently, whereas older code points tend to be more stable. I wouldn't worry about this too much.

Fourth, UCS-2 is positively ancient! As I understand it, UCS-2 supports only 2^16 = 65,536 characters. Unicode supports over 1 million characters. UTF-16 came from UCS-2 so I'm hoping that maybe just maybe "UCS-2" is an outdated label. I'm not sure what you mean by "the encoding conversion, if required at all, can quite happily occur before or after PRECIS." What entity or code would do the conversion here? That stuff about the encoding PRECIS-based application protocols applies to things like XMPP and IMAP or whatever - they specify what their wire protocols are. At the IETF, UTF-8 is the de-facto standard encoding, and other encodings are discouraged for on-the-wire communication. Typically it introduces unnecessary complexity to provide options here (e.g., we had a long debate about allowing UTF-16 or UTF-8 or both for XMPP back in 2003-2004 and decided that only allowing UTF-8 was much simpler for most implementations). How do JavaScript libraries currently handle the encoding of data in UTF-8 for HTTP/HTML or IMAP or XMPP or whatever?

Hope that helps,

Peter

mathiasbynens commented 8 years ago

UTF-16 came from UCS-2 so I'm hoping that maybe just maybe "UCS-2" is an outdated label.

There is a difference between UTF-16 and UCS-2. UCS-2 lacks the concept of surrogate pairs, and therefore interprets e.g. 0xD834 0xDF06 (the UTF-16 encoding for U+1D306 tetragram for centre 𝌆) as two separate characters after decoding.

What JavaScript is using is something in between UTF-16 and UCS-2:

You could argue that it resembles UTF-16, except unmatched surrogate halves are allowed, surrogates in the wrong order are allowed, and surrogate halves are exposed as separate characters. I think you’ll agree that it’s easier to think of this behavior as “UCS-2 with surrogates”.

ezzatron commented 8 years ago

@stpeter, @mathiasbynens Thank you for the replies, both are very useful!

Re case folding vs toLower; I read through the discussions linked, and I'm definitely not spec-conformant there. I was unaware they were different operations. Am I correct in my understanding though, @stpeter, that there may actually be a big problem with the PRECIS specs using case folding in certain parts of the spec? I've opened a specific issue (#6) for this.

Re encodings, I've also opened #7. To clarify what I meant by "the encoding conversion, if required at all, can quite happily occur before or after PRECIS."; I was referring to these (seemingly conflicting) statements from RFC 7564 section 13.1 (emphasis mine):

13. Interoperability Considerations

13.1. Encoding

Although strings that are consumed in PRECIS-based application protocols are often encoded using UTF-8 [RFC3629], _the exact encoding is a matter for the application protocol that uses PRECIS, not for the PRECIS framework._

And statements such as this one from RFC 7613 section 3.2.1 (emphasis mine):

3. Usernames

(snip)

3.2. UsernameCaseMapped Profile

(snip)

3.2.1. Preparation

An entity that prepares a string according to this profile MUST first map fullwidth and halfwidth characters to their decomposition mappings (see Unicode Standard Annex #11 [UAX11]). This is necessary because the PRECIS "HasCompat" category specified in Section 9.17 of [RFC7564] would otherwise forbid fullwidth and halfwidth characters. After applying this width-mapping rule, the entity then MUST ensure that the string consists only of Unicode code points that conform to the PRECIS IdentifierClass defined in Section 4.2 of [RFC7564]. _In addition, the entity then MUST encode the string as UTF-8 [RFC3629]._

It seems a bit strange that encoding is specified here in the preparation step, when encoding has no relevance to the subsequent steps of enforcement (which require that preparation takes place first).

Width mapping, additional mapping, case mapping, normalization, and directionality rules can all be performed on raw unicode codepoints, which can subsequently be encoded using whatever scheme is required (UTF-8 obviously being preferable). This is in fact how I've implemented this library, taking advantage of punycode.js' utility functions to convert a native JavaScript string into an array of codepoints for processing.

I think what's really important here is 1) whether JavaScript's native strings are capable of representing the entirety of Unicode's codepoints (I'm fairly certain they are after reading @mathiasbynens' link) and 2) whether the method I'm using to decode the string into codepoints works correctly for all Unicode characters (which I'm also fairly confident about now). @mathiasbynens does that sound correct?

One option is to simply take encoding out of the equation by having precis-js simply deal with arrays of codepoints. I'm seriously considering this (see #7).

stpeter commented 8 years ago

BTW we have started discussing these issues on the precis@ietf.org list your feedback as an implementer would be greatly appreciated!

ezzatron commented 8 years ago

@stpeter Thanks, I've subscribed, and I'll try to make an effort to catch up with the latest drafts.

eloquent / precis-js

Code review #1

Case mapping

Normalization

Dependencies and their Unicode versions

Character encodings

Unicode 8