Open mourner opened 5 years ago
This seems to be a bug in flow-parser
which produces incorrect range
values in AST for non-latin characters. E.g. parsing // п
and // n
produces [0, 5]
and [0, 4]
ranges respectively, although both should be the same.
same root cause as #5793, I believe. the lexer reports its position in unicode codepoints, of which these are 1, but they are multibyte.
@mroch I believe it's a recent regression, because https://astexplorer.net/ (which uses Flow v0.94.0) shows identical AST for both cases ([0, 4]
).
this is 0.94, when hovering over the FunctionDeclaration in the AST. so it's been there a while, but flow-remove-types regressed because it moved to flow-parser.
this appears to only be an issue with the range
. the loc
matches babylon.
it's a bit easier to see if you do it all on one line: https://flow.org/try/#0PQKk-CAsBmD2MNxA (see the AST tab). since it's all on line 1, the range should match the columns in loc
.
@mroch ah, yeah, you're right. The link I commented earlier triggers the bug if you add a new line after the comment.
hrm i think we intentionally made range
work like this. it's the byte offset in the file, which is what you'd want if you wanted to seek to read that part of the file from disk. but it's not what you'd pass to substring
in JS, which is what tools like flow-remove-types do.
cc @nmote I think we should change range
to match esprima and babylon, and potentially add (an option for) another field like byte_range
or something. will need to figure out which tools need to be updated to use that.
Thanks for looking into this @mroch! Agreed about bringing Flow AST more in line with other parsers.
In the meantime, to fix the breaking bug with flow-remove-types
sooner, should we just rewrite the logic so that it only depends on line
/column
, not range
? I could take a stab at a PR.
alternatively, if it could work with UTF8 bytes instead of JS strings, that would make the existing ranges correct.
for example, perhaps changing flow-remove-types to use Buffers? I think the offsets of a Buffer are in bytes, not codepoints. modifying a buffer in place instead of lots of string concatenation could be a perf win too. https://github.com/facebook/flow/blob/a05cbb202197d57e5e909e11dcbb859f11c48641/packages/flow-remove-types/index.js#L107
@mroch yep, will do!
We are running into the same problem, in our case with a ™ in a comment.
// @flow
import React from 'react';
/*
This should Just Work™
*/
export const SomeComponent = (props: Object) => (
<div {...props}>
This breaks
</div>
);
becomes
import React from 'react';
/*
This should Just Work™
*/
export const SomeComponent = (props:=> (
<div {...props}>
This breaks
</div>
);
Looks like it's been a few months now, and it shows up in all v2 releases of flow-remove-types
-- any updates on a fix?
Flow version: 0.100.0
Expected behavior
Given the following input:
flow-remove-types
should produce:Actual behavior
It produces invalid JS:
The non-latin character is the trigger. Without it, it works correctly. This seems like quite a serious bug considering how common it is to use non-latin characters in comments (e.g. comment in other languages, draw arrows etc.).
cc @mroch @samwgoldman