Closed smacker closed 5 years ago
This report is most probably for js dirver v2.7.0
Steps to reproduce:
docker run --rm -it -p 9432:9432 bblfsh/javascript-driver:v2.7.0
wget https://gist.githubusercontent.com/smacker/833bfbbf187727a1dbf0adc72777136a/raw/e364cfe257502729bc4f7bcfd9e33d31a3819051/bundle.js
node --check bundle.js
bblfsh-cli bundle.js
couldn't parse bundle.js: transform failed: rune out of bounds: 2200447 [0, 2200314)
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
77da50affd41 optimistic_mendeleev 0.02% 1.378GiB / 1.952GiB 70.60% 2.25MB / 13.8kB 11.2MB / 0B 15
Same happens on recently release v2.7.1
and the JS file has valid syntax.
Funny enough, on smaller linux box \w only 1Gb RAM parsing this file hangs it forever.
Working on reducing the test case.
Minimal reproducer:
"𝓏"
Was able to reproduce in SDK.
JS uses UTF-16 code points instead of UTF-8. Need no rewrite the positional index to fix this.
JS uses UTF-16 code points instead of UTF-8
I'm not an expert, but from https://mathiasbynens.be/notes/javascript-encoding it seems that it uses UCS-2 that is different from UTF-16: it lacks the notion of surrogate pair
.
JS VM indexes code units, not the code points and so for characters outside of BMP we have
'𝌆'.length == 2
@dennwc would appreciate if you could help me understand if that affects e.g https://github.com/bblfsh/sdk/pull/393 but from what I can tell, JS strings differ from UTF-16:
unmatched surrogate halves are allowed, surrogates in the wrong order are allowed, and surrogate halves are exposed as separate characters
I'm not sure if it matters though. The way how it thinks of those characters may be different, but as long those characters are UTF-16-something it should work.
To be more specific:
File: https://gist.github.com/smacker/833bfbbf187727a1dbf0adc72777136a (hopefully uploading to gist didn't break it)
bblfsh 2.11.8-drivers