chakra-core / ChakraCore

ChakraCore is an open source Javascript engine with a C API.
MIT License
9.13k stars 1.2k forks source link

Intl: new Date().toLocaleString('de') puts unicode (BiDi) markers around punctuation #599

Closed nojvek closed 7 years ago

nojvek commented 8 years ago
d = new Date().toLocaleString('de') 
"‎17‎.‎03‎.‎2016‎ ‎14‎:‎21‎:‎50"

// I only want to match the time part
// In chrome (v8)
d.match(/(\d+\:\d+\:\d+)/)[1]
"14:27:28" // returns the correct answer

// in edge (chakra)
d.match(/(\d+\:\d+\:\d+)/)
null

// edge matches \: correctly
d.match(/\:/)
[":",":"]
 //edge matches \d+ correctly as well
d.match(/d+/)
["17", "17"]
//matches \d+\: incorrectly
d.match(/\d+\:/)
null // :(
nojvek commented 8 years ago

If you don't mind, I'd like to make it a personal exercise to fix this bug with a PR. So far I've got chakra to build. Its amazing how easy it was. I see the baseline tests and I'll spend some time this weekend to get this fixed and make sure I don't break anything else.

goyakin commented 8 years ago

This isn't a RegExp issue. It looks like we put LEFT-TO-RIGHT MARK (U+200E) around punctuation marks, but Chrome and Firefox don't.

On Edge:

var d = new Date("2016-03-17T21:43:10.645Z").toLocaleString('de');
d === "‎17‎.‎03‎.‎2016‎ ‎14‎:‎43‎:‎10";
d[1] === "1";
d[2] === "7";
d.charCodeAt(3) === 8206;
d[4] === ".";
d.charCodeAt(5) === 8206;
d[6] === "0";

On Chrome:

var d = new Date("2016-03-17T21:43:10.645Z").toLocaleString('de');
d === "‎17‎.‎3‎.‎2016‎ ‎14‎:‎43‎:‎10";
d[0] === "1";
d[1] === "7";
d[2] === ".";
d[3] === "3";

I don't know why we have the "LEFT-TO-RIGHT MARK (U+200E)" around punctuation marks though.

nojvek commented 8 years ago

Yeah that's what I just noticed.

d = new Date().toLocaleString();
WScript.Echo(d);
$ ch dateLocale.js
?3?/?17?/?2016? ?2?:?56?:?45? ?PM

I'll change the title of the bug.

nojvek commented 8 years ago

Could this be an issue in the windows GetDateFormatW ?

How do I start ch.exe on a file and break it so I can attach a debugger?

I don't see a debugging page.

// For Windows Vista and above GetDateFormatEx is preferred
WINBASEAPI
int
WINAPI
GetDateFormatW(
    _In_ LCID Locale,
    _In_ DWORD dwFlags,
    _In_opt_ CONST SYSTEMTIME * lpDate,
    _In_opt_ LPCWSTR lpFormat,
    _Out_writes_opt_(cchDate) LPWSTR lpDateStr,
    _In_ int cchDate
    );
digitalinfinity commented 8 years ago

If you're using Visual Studio, you can just set the command line parameters for ch by right clicking on the ch project, and setting the appropriate field under the Debugging Configuration Property. Then, set your breakpoints in VS, hit F5 and have fun. If you want to use another debugger, you can install the Windows Debugging Tools by following the instructions at https://msdn.microsoft.com/en-us/library/windows/hardware/ff551063(v=vs.85).aspx. You can then run windbg.exe ch.exe and launch it under that. Currently, I don't think there is a way to force breaking into the debugger from a script file in ChakraCore- @Yongqu or @curtisman would know for sure. If there isn't a way right now, feel free to file an issue to have this exposed in debug builds.

nojvek commented 8 years ago

From what I gather, it seems, some of the logic is in Intl.js. This is probably byte code generated and I don't seem to figure out where the implementation of formatDateTime is.

I got debugging to work. I added a 10 sec sleep so I could attach VS.

WScript.Echo(d.toLocaleString());

The breakpoint doesn't seem to hit GetDateLocaleString as I previously hypothesized.

    DateImplementation::GetDateLocaleString(Js::YMD *pymd, TZD *ptzd, DateTimeFlag noDateTime,ScriptContext* scriptContext)

I'm just beginning to realize how complicated chakra can be.

Yongqu commented 8 years ago

you might want to break at JavascriptDate::EntryToLocaleDateString and continue from there.

goyakin commented 8 years ago

According to the ECMAScript spec, the string returned by Date.prototype.toLocaleString needs to be in a human-readable form, but the details are left to the implementation. We insert the BiDi control characters (e.g., LEFT-TO-RIGHT MARK (U+200E)) in order to preserve the order of characters when the string is shown in a UI element.

There is an ongoing discussion on how to align implementations and standardize the use of BiDi control characters in the spec. We've decided to postpone the work on this issue until an agreement is reached as part of the standardization discussion.

@nojvek, if I understand correctly, you're trying to get the localized time string. Would Date.prototype.toLocaleTimeString help you achieve that?

dilijev commented 8 years ago

@bterlson Any word on the TC39 Consensus aspect of this issue?

bterlson commented 8 years ago

@dilijev No word, no progress, and it seems unlikely to make progress from my conversations with people.

dilijev commented 7 years ago

https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/11907540/ is a duplicate of this bug.

BobFrankston commented 7 years ago

I've created a test example to address possibly related issues:

  1. toLocaleString puts a comma after the date in the Nodejs.org version but the Chakra version does not.
  2. if I use dtOptions with timeZone nodejs accepts it and ignore want is doesn't like but Chakra gets an error. For my purpose {hour12:false} is sufficient so I didn't try to fully debug it.
const dtOptions = {
    timeZone: "America/New_York",
    timeZoneName: "short",
    hour12: false,
};

function dx() {
    try {
        console.log(`Version ${JSON.stringify(process.versions)}`);
        const dt = new Date();
        //const dsa = dt.toLocaleString('en-US', dtOptions);
        const dsa = dt.toLocaleString('en-US', {hour12:false});
        console.log(dsa.split('').map(d => d.charCodeAt(0).toString(16)).join(','));
        const ds = dsa.replace(/\u200e/g,"");       // work-around
        console.log(ds.split('').map(d => d.charCodeAt(0).toString(16)).join(','));
        console.log(`Date ${ds}`);
        const dm = ds.match(/^(\d+)\/(\d+)\/(\d+),? ([\d:]+)/);
        const dd = {y: dm[3], m: dm[1], d: dm[2], hms: dm[4]}
        return dd;
    }
    catch(e) {
        console.log(`Error ${e}`);
        debugger;
    }
}

console.log(dx());
dilijev commented 7 years ago

toLocaleString puts a comma after the date in the Nodejs.org version but the Chakra version does not.

I see that there's a comma between the date and the time. Also allowed per spec, but the difference is noted.

> eshost -is test.js
## Source
const dtOptionsChakraBug = {
    timeZone: "America/New_York",
    timeZoneName: "short",
    hour12: false,
};
const dtOptionsSimple = {
    hour12: false,
};

function dx(dtOptions) {
    try {
        // print(`Version ${JSON.stringify(process.versions)}`);
        const dt = new Date();
        const dsa = dt.toLocaleString('en-US', dtOptions);
        print(dsa.split('').map(d => d.charCodeAt(0).toString(16)).join(','));
        const ds = dsa.replace(/\u200e/g,"");       // work-around
        print(ds.split('').map(d => d.charCodeAt(0).toString(16)).join(','));
        print(`Date ${ds}`);
        const dm = ds.match(/^(\d+)\/(\d+)\/(\d+),? ([\d:]+)/);
        const dd = {y: dm[3], m: dm[1], d: dm[2], hms: dm[4]}
        return dd;
    }
    catch(e) {
        print(`Error ${e}`);
        return null;
    }
}

print(JSON.stringify(dx(dtOptionsSimple)));
// print('\ndtOptionsChakraBug');
// print(JSON.stringify(dx(dtOptionsChakraBug)));

#### d8, sm
36,2f,32,2f,32,30,31,37,2c,20,31,37,3a,32,38,3a,32,37
36,2f,32,2f,32,30,31,37,2c,20,31,37,3a,32,38,3a,32,37
Date 6/2/2017, 17:28:27
{"y":"2017","m":"6","d":"2","hms":"17:28:27"}

#### ch-1.2.3, ch-1.3.2, ch-1.5.0, ch-1.4.4, ch-dev-rel, ch-master, ch-dev-dbg
200e,36,200e,2f,200e,32,200e,2f,200e,32,30,31,37,200e,20,200e,31,37,200e,3a,200e,32,38,200e,3a,200e,32,37
36,2f,32,2f,32,30,31,37,20,31,37,3a,32,38,3a,32,37
Date 6/2/2017 17:28:27
{"y":"2017","m":"6","d":"2","hms":"17:28:27"}

#### node
36,2f,32,2f,32,30,31,37,2c,20,31,37,3a,32,38,3a,32,37
36,2f,32,2f,32,30,31,37,2c,20,31,37,3a,32,38,3a,32,37
Date 6/2/2017, 17:28:27
{"y":"2017","m":"6","d":"2","hms":"17:28:27"}

#### node-ch
200e,36,200e,2f,200e,32,200e,2f,200e,32,30,31,37,200e,20,200e,31,37,200e,3a,200e,32,38,200e,3a,200e,32,37
36,2f,32,2f,32,30,31,37,20,31,37,3a,32,38,3a,32,37
Date 6/2/2017 17:28:27
{"y":"2017","m":"6","d":"2","hms":"17:28:27"}

if I use dtOptions with timeZone nodejs accepts it and ignore want is doesn't like but Chakra gets an error. For my purpose {hour12:false} is sufficient so I didn't try to fully debug it.

Already tracked in #3096

dilijev commented 7 years ago

Tracking as part of #3644

zbraniecki commented 6 years ago

I'm not sure if this should be fixed FWTW. Intl operations are by design supposed to be opaque and no expectations should be made as to the output.

Bidi marks are perfectly valid parts of the result string and any operations that are attempting to retrieve data from the formatted string (like the regexp in comment 0) are working against the logic of Intl operations.

My recommendation would be to close this issue as invalid.

nojvek commented 6 years ago

Sure, but it's horribly inconsistent between browsers. Regex on dates is something that developers do all the time. Weird hidden characters have definitely driven me away from Edge.

On Wed, Dec 20, 2017 at 1:46 PM, Zibi Braniecki notifications@github.com wrote:

I'm not sure if this should be fixed FWTW. Intl operations are by design supposed to be opaque and no expectations should be made as to the output.

Bidi marks are perfectly valid parts of the result string and any operations that are attempting to retrieve data from the formatted string (like the regexp in comment 0) are working against the logic of Intl operations.

My recommendation would be to close this issue as invalid.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Microsoft/ChakraCore/issues/599#issuecomment-353191462, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-JVP1TLEimqutVrwnC58T5i1-2qGECks5tCYAegaJpZM4HzYLg .

zbraniecki commented 6 years ago

They're only weird if you are unfamiliar with the issues of bidirectionality. Bidi chars are needed and we will be adding them. "Developers" also tend to do other bad practices like sync io. I don't believe that ignorance in the subject someone is working on is a good reason to sacrifice the quality of the API.

BobFrankston commented 6 years ago

At this point I'm with keeping the bidi because parsing the output has enough other problems and focusing on TC39 date improvements to get pat today's kludges.