Clarify source map type

kylef commented 7 years ago

There are discrepancies in the API Blueprint parser and Swagger adapter because of source map positions are treated differently (https://github.com/apiaryio/fury.js/issues/63).

In Swagger adapter, the source map represents the position of a character in a JavaScript string. Whereas the API Blueprint parser (snowcrash/drafter) will craft source maps based on the byte offset in the underlying buffer.

We need to decide how source maps should be represented and then align the adaptors.

/cc @w-vi

w-vi commented 7 years ago

I'll come to this later but a quick note: Relying on JavaScript strings is little unfortunate as we rely on string implementation in one runtime and the rest then needs to treat it the same way. AFAIK JavaScript uses UTF-16.

kylef commented 7 years ago

Yes, and this could lead to different behaviours based on the language and string implementations. Some conversions through buffer -> string -> buffer may actually be lossy and for some implementations that handle unicode normalisation or have grapheme breaking rules.

w-vi commented 7 years ago

I have thought about this a bit and what I think is best is to use offset in characters where character is not a byte but a logical unit whose size depends on the encoding of the document. In case of utf-8 it might be 3 characters (runes) but 13 bytes. What do you think @kylef ?

kylef commented 7 years ago

@w-vi If I understand you correctly, then I agree.

Due to language differences this can be confusing, especially when languages have different grapheme breaking rules. I could see it being very problematic for consumers to find the original source from a source map.

I think we would should alter our JS parsing libraries (Fury, Protagonist) to accept buffer instead of string to keep intact the original source document and how the graphemes are broken down.

For the Helium parsing service, the input document is embedded inside a JSON payload and already serialised. I fear we could loose the original encoding and source maps can become incorrect. api.apibleuprint.org deals with unserialised data and thus wouldn't have this problem.

Just to confirm we're on the same page. Here's an example:

Given I have five characters (e, \r\n, é, é and 👨‍👨‍👦‍👦), and I want to refer to the family (👨‍👨‍👦‍👦) and the characters have been encoded as utf-8. A character is equal to a unicode grapheme cluster. You would propose the source map be: 4 for location and 1 for length?

I want to point out that not all of those characters are the same, although they very well look identical (é vs é). Along with \r\n being a single grapheme cluster and not two (http://www.unicode.org/reports/tr29/#GB3).

Here is Swift showing the length of those areas and also the characters in base64.

let eAcute: Character = "\u{E9}"                         // é
let combinedEAcute: Character = "\u{65}\u{301}"          // e followed by
// eAcute is é, combinedEAcute is é

let characters: [Character] = ["e", "\r\n", eAcute, combinedEAcute, "👨‍👨‍👦‍👦"]
let string = String(characters)
let utf8Data = string.data(using: .utf8)

print(characters)
// ["e", "\r\n", "é", "é", "👨‍👨‍👦‍👦"]

print(characters.map { String($0).utf8.count })
// [1, 2, 2, 3, 25]

print(characters.count)
// 5
print(string.utf8.count)
// 33

print(string.utf16.count)
// 17

print(utf8Data)
// Optional(33 bytes)

print(utf8Data?.base64EncodedString())
// Optional("ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm")

Then lets take the base64 and decode in Python 3:

>>> import base64
>>> data = base64.decodebytes(b'ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm')
>>> string = data.decode('utf-8')
>>> data
b'e\r\n\xc3\xa9e\xcc\x81\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa6\xe2\x80\x8d\xf0\x9f\x91\xa6'
>>> len(data)
33
>>> string
'e\r\néé👨\u200d👨\u200d👦\u200d👦'
>>> len(string)
13
>>> string[1:2]
'\r'
>>> string[2:3]
'\n'

I am not sure how Python got 13 as the length, this would seem a bug in grapheme breaking. It is not the length of characters nor utf8 or utf16. Python is also treating \r and \n as separate characters.

Then in Node 6 (perhaps there is another way of doing this, I am not that proficient in Node):

> const { StringDecoder } = require('string_decoder')
> const data = Buffer.from('ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm', 'base64')
undefined
> data
<Buffer 65 0d 0a c3 a9 65 cc 81 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 91 a6 e2 80 8d f0 9f 91 a6>
> data.length
33
> const decoder = new StringDecoder('utf8');
undefined
> console.log(decoder.write(data));
e
éé👨‍👨‍👦‍👦
undefined
> console.log(decoder.write(data).length);
17

It looks like strings are internally stored as utf16, which means that when it comes to serialising it back to utf-8 it may normalise the output (both é would become the same size if normalisation occurs). Since we are embedding the payload as a unicode string into a JSON payload for the parsing service request we have likely lossed the original form which can result in the source maps becoming incorrect from what the user requested. The APIs for Fury/Protagonist/Drafter.js accept strings and not buffer so that the deserialisation has already occurred and re-serialisation into utf-8 for the parser may also provide loss of information.

w-vi commented 7 years ago

Yes, we are on the same page and mean the same thing. Question is if we should not impose utf-8 as the only acceptable encoding require the input to be base64 encoded in case of helium so we get the raw bytes instead of string and thus loosing the original document. Javascript uses utf-16 for strings not Node specific but Javascript in general. And the Python looks funny, I'll probably look into it in more detail.

pksunkara commented 7 years ago

Since we are embedding the payload as a unicode string into a JSON payload for the parsing service request

Wait, we do that? I thought the string we are passing are utf-8 or something.

Other than this, can we all agree that there is no changed needed for the refract spec other than saying that the location and lenght of the source map element refers to the bytes and not characters.

kylef commented 7 years ago

Specification already states that the source maps contain character maps and not bytes:

This location SHOULD include the characters used to build the parent element A source map is a series of character-blocks

Zero-based index of a character in the source document Count of characters starting from the character index.

So no chance is needed in the specification. However neither API Blueprint nor Swagger parsers are following these rules.

pksunkara commented 7 years ago

I thought we wanted to change it to bytes according to the above discussion.

kylef commented 6 years ago

After giving this a bit of though, I think we should use byte based source maps from the original source document. Using characters will be problematic for the following reasons:

Various programming languages / string implementations have different idea of what a character is. Swift seems to be the only one that follows the Unicode grapheme cluster boundary rules correctly (example: \r\n being a single character). Both Node/JS and Python are problematic and have a different idea of what a character should be (by design or not).
If the document has an invalid or non legal Unicode values, it will be problematic trying to refer to it via a source map. Or anything after it. Using a byte-based source map we have more control over providing source maps for an individual byte inside a unicode grapheme cluster.

Steps to Proceed

[ ] Update API Elements specification to use bytes (which is what the biggest implementation already does in Drafter)
[ ] Update Swagger adapter to use bytes
[ ] Test Apiary tools to ensure correct support for bytes, I think they already use bytes as that has been used in Snowcrash/Drafter for some time.

kylef commented 6 years ago

Additional note, I think we should provide convinces in Fury/Minim-API-Description to convert a source map to a specific line number as this a common pattern that various editors and tooling re-implements.

tjanc commented 6 years ago

Ideally we'd like to index by character. But if we do that, we need to understand all supported encodings. @kylef What are the encodings we officially support? What about the option of imposing utf-8, as mentioned by @w-vi ?

apiaryio / api-elements

Clarify source map type #25

Steps to Proceed