bblfsh / javascript-driver

GNU General Public License v3.0
9 stars 13 forks source link

transform failed: rune out of bounds: 2200447 [0, 2200314) #71

Closed smacker closed 5 years ago

smacker commented 5 years ago

File: https://gist.github.com/smacker/833bfbbf187727a1dbf0adc72777136a (hopefully uploading to gist didn't break it)

bblfsh 2.11.8-drivers

bzz commented 5 years ago

This report is most probably for js dirver v2.7.0

Steps to reproduce:

docker run --rm -it -p 9432:9432 bblfsh/javascript-driver:v2.7.0
wget https://gist.githubusercontent.com/smacker/833bfbbf187727a1dbf0adc72777136a/raw/e364cfe257502729bc4f7bcfd9e33d31a3819051/bundle.js

node --check bundle.js

bblfsh-cli bundle.js
 couldn't parse bundle.js: transform failed: rune out of bounds: 2200447 [0, 2200314)

docker stats --no-stream
CONTAINER ID        NAME                   CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
77da50affd41        optimistic_mendeleev   0.02%               1.378GiB / 1.952GiB   70.60%              2.25MB / 13.8kB     11.2MB / 0B         15

Same happens on recently release v2.7.1 and the JS file has valid syntax.

Funny enough, on smaller linux box \w only 1Gb RAM parsing this file hangs it forever.

dennwc commented 5 years ago

Working on reducing the test case.

dennwc commented 5 years ago

Minimal reproducer:

"𝓏"
dennwc commented 5 years ago

Was able to reproduce in SDK.

dennwc commented 5 years ago

JS uses UTF-16 code points instead of UTF-8. Need no rewrite the positional index to fix this.

dennwc commented 5 years ago

https://github.com/bblfsh/sdk/pull/392

bzz commented 5 years ago

JS uses UTF-16 code points instead of UTF-8

I'm not an expert, but from https://mathiasbynens.be/notes/javascript-encoding it seems that it uses UCS-2 that is different from UTF-16: it lacks the notion of surrogate pair.

JS VM indexes code units, not the code points and so for characters outside of BMP we have

'𝌆'.length == 2

@dennwc would appreciate if you could help me understand if that affects e.g https://github.com/bblfsh/sdk/pull/393 but from what I can tell, JS strings differ from UTF-16:

unmatched surrogate halves are allowed, surrogates in the wrong order are allowed, and surrogate halves are exposed as separate characters

dennwc commented 5 years ago

I'm not sure if it matters though. The way how it thinks of those characters may be different, but as long those characters are UTF-16-something it should work.

To be more specific: