eclipse-theia / theia

Eclipse Theia is a cloud & desktop IDE framework implemented in TypeScript.
http://theia-ide.org
Eclipse Public License 2.0
19.9k stars 2.49k forks source link

ripgrep: Use the new json output #2622

Open simark opened 6 years ago

simark commented 6 years ago

ripgrep now seems to have an option to output results as json:

https://github.com/BurntSushi/ripgrep/pull/1017

We should look into it, to replace the escape sequence parsing code that we have now.

BurntSushi commented 6 years ago

Exhaustive docs for the format can be found here: https://docs.rs/grep-printer/0.1.0/grep_printer/struct.JSON.html

If you have any feedback on the format, it would be great to get it! Particularly since this hasn't landed in a release yet, so I can still make breaking changes to the format if there is anything seriously wrong with it.

simark commented 6 years ago

Thanks! @marcdumais-work, then maybe we can work on this sooner than later? (@BurntSushi is ripgrep's author).

simark commented 6 years ago

@BurntSushi, I'm trying to use the JSON interface, so far it's going well. The only thing is that we need the match position in the line in terms of characters (when one multi-byte UTF-8 character counts as one), but rg only gives the offset in bytes. I imagine most editors using rg will be interested in that value as well. Is it an information readily available in rg that could be output at no extra cost? Otherwise, we'll continue converting byte offset to character offset ourselves.

BurntSushi commented 6 years ago

@simark Ah interesting! Generally getting the offset in terms of characters in "the wrong thing," so I'm not sure I'd think that other folks would want it. Moreover, the output isn't actually guaranteed to be UTF-8, and in that case, it's pretty difficult to determine what exactly is a character. Certainly, ripgrep never ever never deals in character offsets. It's always byte offsets.

The reason why "character index" is tricky to use is because "character" is difficult to define. A common but typically incorrect way of approaching this is to store your string as a series of 32 bit integers, each representing a codepoint, and calling each codepoint a character. But multiple codepoints can join to together to form a grapheme, which a human might see as a single character.

If you know what you're doing here with respect to Unicode, then feel free to ignore me! But if this sounds interesting to you, I'd be happy to elaborate further. :-) In short, if you do insist on character indices, you will definitely need to do that conversion yourself.

simark commented 6 years ago

@simark Ah interesting! Generally getting the offset in terms of characters in "the wrong thing," so I'm not sure I'd think that other folks would want it. Moreover, the output isn't actually guaranteed to be UTF-8, and in that case, it's pretty difficult to determine what exactly is a character. Certainly, ripgrep never ever never deals in character offsets. It's always byte offsets.

The reason why "character index" is tricky to use is because "character" is difficult to define. A common but typically incorrect way of approaching this is to store your string as a series of 32 bit integers, each representing a codepoint, and calling each codepoint a character. But multiple codepoints can join to together to form a grapheme, which a human might see as a single character.

If you know what you're doing here with respect to Unicode, then feel free to ignore me! But if this sounds interesting to you, I'd be happy to elaborate further. :-) In short, if you do insist on character indices, you will definitely need to do that conversion yourself.

You are right, text encoding is complex and what I am trying to do here is most likely not accurate in all scenarios. I am only considering UTF-8 for now (where the arbitrary data object has the text field), to keep things simpler for now.

When parsing the UTF-8 string in the text field of the arbitrary data object, node's JSON parser reads the UTF-8 encoded string in a Javascript string object. I am looking for which character in that string corresponds to the byte offset output by rg. Not sure how to do that...

BurntSushi commented 6 years ago

@simark Yeah unfortunately I've not used Javascript for a while, so I'm not familiar with best practices there, but I'd probably start with a UTF-8 library? https://www.npmjs.com/package/utf8

simark commented 6 years ago

The problem is that once you do JSON.parse(message_from_ripgrep), the strings are already converted from UTF-8 to a Javascript string. I don't think we have access to the original bytes that ripgrep byte offsets refer to. And I suppose that if I encode back the Javascript string to UTF-8, I may not get the same as the original?

BurntSushi commented 6 years ago

You should get back the original, yes. If you're only dealing with text, then it is guaranteed to be valid UTF-8, which means the transformation to UTF-16 (which is what I assume Javascript is using) and back to UTF-8 should be non-lossy. Because of that property, it seems like you could technically avoid the extra UTF-8 encode step and just count bytes by traversing the characters in your string, assuming you can get codepoints:

byte_offset := 0
char_offset := 0
for codepoint in javascriptString
    // invariant: byte_offset is where codepoint begins in the original string
    if codepoint <= 0x7F
        byte_offset += 1
    else if codepoint <= 0x07FF
        byte_offset += 2
    else if codepoint <= 0xFFFF
        byte_offset += 3
    else if codepoint <= 0x10FFFF
        byte_offset += 4
    else
        throw Exception("unreachable")
    char_offset += 1

I think that code should be enough to then establish a mapping of some sort or otherwise do your conversion. IIRC, there will be one tricky part of the above: you'll need to make sure Javascript hands you complete codepoints and not surrogates.

cc @roblourens --- I assume you'll need to do something like this for VS Code, and I think you're using Javascript as the glue code. Do you have any advice here?

simark commented 6 years ago

You should get back the original, yes. If you're only dealing with text, then it is guaranteed to be valid UTF-8, which means the transformation to UTF-16 (which is what I assume Javascript is using) and back to UTF-8 should be non-lossy. Because of that property, it seems like you could technically avoid the extra UTF-8 encode step and just count bytes by traversing the characters in your string, assuming you can get codepoints:

...

Ah ok that makes sense. I don't know why, I thought there would be some normalization going on. I'll try what you suggest that, it should be enough 99.9% of the time.

BurntSushi commented 6 years ago

@simark Yeah, I guess you should double check what your implementation of Javascript does. I wouldn't expect any normalization here, but it is possible!

simark commented 6 years ago

I am hitting a weird case. I get different results when invoking rg from the command line than from node (from a test case ran by mocha).

This is the command line my test runs:

/home/emaisin/src/ripgrep/target/release/rg --json --max-count=100 --max-columns=250 --ignore-case --fixed-strings  jag /tmp/d-11885-12601-1qlg5j2.1tww

This is what the test receives from rg:

{"type":"match","data":{"path":{"text":"/tmp/d-11885-12601-1qlg5j2.1tww/utf8-file"},"lines":{"text":"Var är jag?  Varför är jag här?\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":" jag"},"start":7,"end":11},{"match":{"text":" jag"},"start":25,"end":29}]}}

This is what I get when running the command manually from the shell:

{"type":"match","data":{"path":{"text":"/tmp/d-11885-12601-1qlg5j2.1tww/utf8-file"},"lines":{"text":"Var är jag?  Varför är jag här?\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"jag"},"start":8,"end":11},{"match":{"text":"jag"},"start":26,"end":29}]}}

Notice the difference in the start fields near the end of the lines. I looked at the environment variables in both cases, I didn't find any significant difference. The file being searched is:

$ cat /tmp/d-11885-12601-1qlg5j2.1tww/utf8-file
Var är jag?  Varför är jag här?
$ xxd /tmp/d-11885-12601-1qlg5j2.1tww/utf8-file
00000000: 5661 7220 c3a4 7220 6a61 673f 2020 5661  Var ..r jag?  Va
00000010: 7266 c3b6 7220 c3a4 7220 6a61 6720 68c3  rf..r ..r jag h.
00000020: a472 3f0a   

Any idea what could cause this?

BurntSushi commented 6 years ago

@simark In the results from your test harness, note that the match starts with a space: {"text":" jag"}. My bet is that your test suite is somehow searching for _jag (with a leading space instead of an underscore), so I'd double check that.

BurntSushi commented 6 years ago

@simark Also, --max-columns has no effect with the --json flag. (Most output specific options don't impact JSON output.)

simark commented 6 years ago

@simark In the results from your test harness, note that the match starts with a space: {"text":" jag"}. My bet is that your test suite is somehow searching for jag (with a leading space), so I'd double check that.

Ah damn, well spotted, thanks! Sorry about that noise.

simark commented 6 years ago

Here's the patch:

https://github.com/simark/theia/commits/rg-json

When a ripgrep with the json output is released and made available through vscode-ripgrep, we can revisit it.

Thanks for your help @BurntSushi!

BurntSushi commented 6 years ago

@simark Awesome! Thanks so much for trying it out and giving feedback!