Syntax highlighting sometimes incorrect with multiline constructs

dandavison commented 4 years ago

If delta receives a hunk that looks something like the example below, then it will syntax highlight the code as a string. (This is a python example, but the problem exists in any language with multiline string literals I think):

    ...and this is the end of my docstring.
    """
    x = some_function("some_string")  # This line should be syntax-highlighted as code.

That seems like an inevitable ambiguity. However, are there perhaps some heuristics that could be introduced to allow delta to guess which side of the """ is code and which side string literal?

dandavison commented 4 years ago

From @kofls in #316

I think a simple solution would be to get the syntax style per file and use that for each diff section

Do you mean that delta should try to locate the entire file on disk, and use that to determine exactly what is going on in a diff that shows just a fragment of the file? That's a good point, delta could try to do that. It won't always work of course, because the diff might just be piped to delta and the files mentioned in the diff might not even exist, but delta does already use libgit2 to see if it is in a git repo. As you say, it could be an option.

kofls commented 4 years ago

Yup, that was what I meant. I hadn't considered that the diff source might not be files.

YodaEmbedding commented 3 years ago

Quick n' dirty: I think that syntax highlighting both portions (even if incorrect) is less disastrous than not highlighting the code at all. Perhaps a quick and simple workaround would be to ignore """ when highlighting, so that everything gets highlighted?

Robust and accurate: delta knows the file location, line numbers, and the line number of function definition. So read the file, skip to the function definition, then highlight from there until the end of the function. Now extract the relevant lines from the diff and display those. There are probably some edge cases to handle but that seems to be a good way to do this properly.

Alphare commented 3 years ago

Robust and accurate: delta knows the file location, line numbers, and the line number of function definition

FYI this would only work with Git which is content-addressable, and not (as simply) with Mercurial. I'm not even sure that every operation in Git can be recovered by looking up a file given the information available in a diff, but I don't know enough about Git.

Alphare commented 3 years ago

There exists the git diff -U and hg diff -U flags that give more context. delta could use this flag and then strip down the added context. Note that the theoretical "right" approach in Mercurial would be to use some kind of extdiff mechanism, but that's probably more work. :)

TheKevJames commented 2 years ago

Potentially a middle-ground between @YodaEmbedding's options:

Try to highlight both sides, capture how successful we were (does bat expose details on what the guessed language was? This could even be a simple "bat guessed the wrong language for this file, probably a comment!"), only show the highlights if error_count < n.

dandavison commented 2 years ago

Hi @TheKevJames, delta does not use bat for syntax highlighting ~or guessing languages~. I'm curious -- could you tell me where you got the impression that it does? You're not alone, it's a common misconception!

TheKevJames commented 2 years ago

@dandavison oh, interesting! I took a quick glance at the README before writing my above message and came across the following lines:

Language syntax highlighting with the same syntax-highlighting themes as bat All the syntax-highlighting color themes that are available with bat are available with delta:

They definitely don't state that "bat is used for syntax highlighting" directly, but my read from them implied it. IMO "why mention bat if it's not relevant here?"

Guesswork start here: perhaps the library you use for syntax highlighting is just the same one as bat's, or something like that? Maybe mentioning the library could help? eg.

"All the syntax-highlighting color themes that are available with $library -- the same library used by bat -- are available with delta"

dandavison commented 2 years ago

Actually, re-reading what you wrote, you're just talking about guessing languages, not syntax highlighting. OK, so the way this works is:

Delta uses some code from bat to map a file name to a language.
Delta uses syntect to syntax highlight that language.

dandavison commented 2 years ago

Yes, exactly, you're right -- delta and bat use the same syntax highlighting library. Thanks, I'll change the README as you suggest -- this confuses lots of people. I thought it might help to mention that they're the same themes to reassure people that they will be able to create a visually consistent experience using both programs.

TheKevJames commented 2 years ago

Definitely agree on mentioning the similarity to bat -- the two having the same themes was certainly relevant to me when I started looking into using delta! But yeah, mentioning/linking to syntect will certainly help folks looking for a more specific understanding of what lib does what thing.

So in that case, it looks to me like the functionality I described would best be implemented via changes to syntect code; eg. they would need to offer someway to check if a given SyntaxSet "seems to be correct".

My initial thought is that "seems to be correct" might be approximate by checking on the number of modifications: my naive assumption is that in cases where we're highlighting the code we would get many tokens highlighted with their syntax but in cases where we're highlighting the comment we would get few highlights (or perhaps "all the same"? Not sure if there's a highlight for "this token doesn't make sense here").

So in that case, an option which doesn't require syntect changes might be to modify this chunk of delta.

Pardon my incredible lack of rust fluency here, but in attempted psuedo-code / bad code, we might be able to do something like:

let mut line_styles = Vec::new();
for (line, _) in lines.iter() {
    let sections = highlighter
        .highlight_line(line, &config.syntax_set)
        .unwrap();
    # MIN_STYLED_TOKENS might be "at least n tokens should be highlighted"
    # alternatively, it could be "at least n% of the line is hightlighted"
    let is_styled = sections
        .reduce(|acc, (style, _)| {
            if (style != NULL && style != INVALID_TOKEN) { acc += 1 }
        }) <= MIN_STYLED_TOKENS;
    line_styles.push((sections, is_styled));
}

# ie. "split our vector into lines before and lines after a block comment"
let index = line_styles.iter().position(|(line, _)| line == "\"\"\"").unwrap();
let styled_tokens_before = line_styles[..index]
    .filter(|(_, is_styled)| { is_styled })
    .count();
let styled_tokens_after = line_styles[index..]
    .filter(|(_, is_styled)| { is_styled })
    .count();

# This could also be thresholded: eg. "if they're close enough, style both"
if (styled_tokens_before > styled_tokens_after) {
    line_sections.push(line_styles[..index]);
    for (line, _) in lines[index..].iter() {
        line_sections.push(vec![(config.null_syntect_style, line.as_str())]);
    }
} else {
    for (line, _) in lines[..index].iter() {
        line_sections.push(vec![(config.null_syntect_style, line.as_str())]);
    }
    line_sections.push(line_styles[index..]);
}

ofek commented 2 weeks ago

This is still an issue, example we encountered today:

dandavison / delta

Syntax highlighting sometimes incorrect with multiline constructs #117