Open simark opened 6 years ago
Exhaustive docs for the format can be found here: https://docs.rs/grep-printer/0.1.0/grep_printer/struct.JSON.html
If you have any feedback on the format, it would be great to get it! Particularly since this hasn't landed in a release yet, so I can still make breaking changes to the format if there is anything seriously wrong with it.
Thanks! @marcdumais-work, then maybe we can work on this sooner than later? (@BurntSushi is ripgrep's author).
@BurntSushi, I'm trying to use the JSON interface, so far it's going well. The only thing is that we need the match position in the line in terms of characters (when one multi-byte UTF-8 character counts as one), but rg only gives the offset in bytes. I imagine most editors using rg will be interested in that value as well. Is it an information readily available in rg that could be output at no extra cost? Otherwise, we'll continue converting byte offset to character offset ourselves.
@simark Ah interesting! Generally getting the offset in terms of characters in "the wrong thing," so I'm not sure I'd think that other folks would want it. Moreover, the output isn't actually guaranteed to be UTF-8, and in that case, it's pretty difficult to determine what exactly is a character. Certainly, ripgrep never ever never deals in character offsets. It's always byte offsets.
The reason why "character index" is tricky to use is because "character" is difficult to define. A common but typically incorrect way of approaching this is to store your string as a series of 32 bit integers, each representing a codepoint, and calling each codepoint a character. But multiple codepoints can join to together to form a grapheme, which a human might see as a single character.
If you know what you're doing here with respect to Unicode, then feel free to ignore me! But if this sounds interesting to you, I'd be happy to elaborate further. :-) In short, if you do insist on character indices, you will definitely need to do that conversion yourself.
@simark Ah interesting! Generally getting the offset in terms of characters in "the wrong thing," so I'm not sure I'd think that other folks would want it. Moreover, the output isn't actually guaranteed to be UTF-8, and in that case, it's pretty difficult to determine what exactly is a character. Certainly, ripgrep never ever never deals in character offsets. It's always byte offsets.
The reason why "character index" is tricky to use is because "character" is difficult to define. A common but typically incorrect way of approaching this is to store your string as a series of 32 bit integers, each representing a codepoint, and calling each codepoint a character. But multiple codepoints can join to together to form a grapheme, which a human might see as a single character.
If you know what you're doing here with respect to Unicode, then feel free to ignore me! But if this sounds interesting to you, I'd be happy to elaborate further. :-) In short, if you do insist on character indices, you will definitely need to do that conversion yourself.
You are right, text encoding is complex and what I am trying to do here is most likely not accurate in all scenarios. I am only considering UTF-8 for now (where the arbitrary data object has the text
field), to keep things simpler for now.
When parsing the UTF-8 string in the text
field of the arbitrary data object, node's JSON parser reads the UTF-8 encoded string in a Javascript string object. I am looking for which character in that string corresponds to the byte offset output by rg. Not sure how to do that...
@simark Yeah unfortunately I've not used Javascript for a while, so I'm not familiar with best practices there, but I'd probably start with a UTF-8 library? https://www.npmjs.com/package/utf8
The problem is that once you do JSON.parse(message_from_ripgrep)
, the strings are already converted from UTF-8 to a Javascript string. I don't think we have access to the original bytes that ripgrep byte offsets refer to. And I suppose that if I encode back the Javascript string to UTF-8, I may not get the same as the original?
You should get back the original, yes. If you're only dealing with text
, then it is guaranteed to be valid UTF-8, which means the transformation to UTF-16 (which is what I assume Javascript is using) and back to UTF-8 should be non-lossy. Because of that property, it seems like you could technically avoid the extra UTF-8 encode step and just count bytes by traversing the characters in your string, assuming you can get codepoints:
byte_offset := 0
char_offset := 0
for codepoint in javascriptString
// invariant: byte_offset is where codepoint begins in the original string
if codepoint <= 0x7F
byte_offset += 1
else if codepoint <= 0x07FF
byte_offset += 2
else if codepoint <= 0xFFFF
byte_offset += 3
else if codepoint <= 0x10FFFF
byte_offset += 4
else
throw Exception("unreachable")
char_offset += 1
I think that code should be enough to then establish a mapping of some sort or otherwise do your conversion. IIRC, there will be one tricky part of the above: you'll need to make sure Javascript hands you complete codepoints and not surrogates.
cc @roblourens --- I assume you'll need to do something like this for VS Code, and I think you're using Javascript as the glue code. Do you have any advice here?
You should get back the original, yes. If you're only dealing with text, then it is guaranteed to be valid UTF-8, which means the transformation to UTF-16 (which is what I assume Javascript is using) and back to UTF-8 should be non-lossy. Because of that property, it seems like you could technically avoid the extra UTF-8 encode step and just count bytes by traversing the characters in your string, assuming you can get codepoints:
...
Ah ok that makes sense. I don't know why, I thought there would be some normalization going on. I'll try what you suggest that, it should be enough 99.9% of the time.
@simark Yeah, I guess you should double check what your implementation of Javascript does. I wouldn't expect any normalization here, but it is possible!
I am hitting a weird case. I get different results when invoking rg from the command line than from node (from a test case ran by mocha).
This is the command line my test runs:
/home/emaisin/src/ripgrep/target/release/rg --json --max-count=100 --max-columns=250 --ignore-case --fixed-strings jag /tmp/d-11885-12601-1qlg5j2.1tww
This is what the test receives from rg:
{"type":"match","data":{"path":{"text":"/tmp/d-11885-12601-1qlg5j2.1tww/utf8-file"},"lines":{"text":"Var är jag? Varför är jag här?\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":" jag"},"start":7,"end":11},{"match":{"text":" jag"},"start":25,"end":29}]}}
This is what I get when running the command manually from the shell:
{"type":"match","data":{"path":{"text":"/tmp/d-11885-12601-1qlg5j2.1tww/utf8-file"},"lines":{"text":"Var är jag? Varför är jag här?\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"jag"},"start":8,"end":11},{"match":{"text":"jag"},"start":26,"end":29}]}}
Notice the difference in the start
fields near the end of the lines. I looked at the environment variables in both cases, I didn't find any significant difference. The file being searched is:
$ cat /tmp/d-11885-12601-1qlg5j2.1tww/utf8-file
Var är jag? Varför är jag här?
$ xxd /tmp/d-11885-12601-1qlg5j2.1tww/utf8-file
00000000: 5661 7220 c3a4 7220 6a61 673f 2020 5661 Var ..r jag? Va
00000010: 7266 c3b6 7220 c3a4 7220 6a61 6720 68c3 rf..r ..r jag h.
00000020: a472 3f0a
Any idea what could cause this?
@simark In the results from your test harness, note that the match starts with a space: {"text":" jag"}
. My bet is that your test suite is somehow searching for _jag
(with a leading space instead of an underscore), so I'd double check that.
@simark Also, --max-columns
has no effect with the --json
flag. (Most output specific options don't impact JSON output.)
@simark In the results from your test harness, note that the match starts with a space: {"text":" jag"}. My bet is that your test suite is somehow searching for jag (with a leading space), so I'd double check that.
Ah damn, well spotted, thanks! Sorry about that noise.
Here's the patch:
https://github.com/simark/theia/commits/rg-json
When a ripgrep with the json output is released and made available through vscode-ripgrep, we can revisit it.
Thanks for your help @BurntSushi!
@simark Awesome! Thanks so much for trying it out and giving feedback!
ripgrep now seems to have an option to output results as json:
https://github.com/BurntSushi/ripgrep/pull/1017
We should look into it, to replace the escape sequence parsing code that we have now.