Post Process Extractors

[Edit because I pressed some random shortcut that submitted this before it was done]

Hi there. New to this tool and certainly not an LSP expert. I've been going through the prompt template examples, and it struck me as odd that nowhere is the LLM actually told what language it's dealing with. Sure, it can probably infer it from the context, particularly if there are few-shot examples. But why make the LLM figure it out when we already know it.

I've been reading the LSP spec to see if that information is at any point provided to the LSP server by the client. It turns out Document Synchronization messages contain an object with a languageId key, with values such as python. The logs of my editor (Helix) confirm that this key is sent by the client on message requests of this type.

I would use this key either by including it in the system prompt or by formatting code as Markdown blocks and adding it after the first set of triple backticks. An example of the second:

{
      "role": "user",
      "content": "```{LANGUAGE}\ndef greet(name):\n    print(f\"Hello, {<CURSOR>}\")\n```"
}

What do you think? Is this something that could easily be added?

This is a great idea and easy to add! Using it would look something like:

{
      "role": "user",
      "content": "```{LANGUAGE}\n{CODE}```"
}

Where the prompt when expanded would look like:

{
      "role": "user",
      "content": "```python\ndef greet(name):\n    print(f\"Hello, {<CURSOR>}\")\n```"
}

One thing we need to think through is the LLM response. If we send it markdown will it respond with markdown? We currently don't post process LLM responses, but we may need to if we being sending it markdown. In other words, we may want to provide some kind of post process regex response matcher that extracts text so users can specify the code to insert from the LLM should be the text in the markdown code block.

This extractor would actually be really useful for chain of though prompting: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/chain-of-thought#example-writing-donor-emails-structured-guided-cot

Notice the actual answer for Claude is between <answer> and </answer> and to let users benefit from chain of thought prompting we would need to allow them to define custom extractors (or have a preset list of extractors like Markdown, etc...).

If you want to take a crack at adding it I would be happy to help make some suggestions, otherwise we can add it to our roadmap.

Happy to hear you agree!

[... ] we may want to provide some kind of post process regex response matcher

Funnily enough, I actually typed up another issue right after arguing for exactly this, but I started second-guessing its true necessity halfway through and left it at that.

If we send it markdown will it respond with markdown?

Not necessarily. In fact, I've empirically verified as much by modifying the user messages in the example chat mode prompt in the wiki so that they all contain Markdown code blocks, while leaving the assistant messages untouched. The result: no difference in the generated response than before, i.e. no triple backticks to parse out. So it seems a few in-context example pairs where the answer is 'bare' sufficiently conditions the final response to also be bare.

On the other hand, few-shot learning leads to slower responses due to the increased prompt length. I just did an experiment with Claude 3.5 Sonnet (in the browser dashboard):

So by starting the chat completion with a final prefilled assistant message, you get a markdown code block. The generated code indentation is correct when the output is formatted as a code block. Interestingly, when I repeat the example without the prefilled assistant message, I get a bare code response with unindented code. Not what you'd want here.

This extractor would actually be really useful for chain of though prompting:

Yes! I believe that's where the real value of an additional post-processing mechanism lies. It occurred to me that the system message in the examples asks for a step-by-step response, but none of the few-shot examples demonstrate it, nor is there any way to separate the thought part of the response from the answer afterward.

Changing this would allow for an enormous amount of creativity.

preset list of extractors like Markdown

... Markdown code blocks, the Claude XML tags, ... My thoughts quickly wonder to JSON, but that would call for something different than regex. So maybe let's restrict the scope to regex and provide presets, or just well-documented examples, of those two.

If you want to take a crack at adding it

With all the Rust-based projects I'm making feature requests to lately, it's starting to seem more and more likely that I'm going to have to start learning Rust sooner or later. But right now I have zero knowledge of it, so I'm afraid I won't be much help here. But I'm definitely down to brainstorm on the overall design and help write documentation

Not necessarily. In fact, I've empirically verified as much by modifying the user messages in the example chat mode prompt in the wiki so that they all contain Markdown code blocks, while leaving the assistant messages untouched. The result: no difference in the generated response than before, i.e. no triple backticks to parse out. So it seems a few in-context example pairs where the answer is 'bare' sufficiently conditions the final response to also be bare.

Got it thanks for testing that!

So by starting the chat completion with a final prefilled assistant message, you get a markdown code block.

Pre-filling the assistant response is a really cool. It is something I would like to support for Anthropic's API I need to look into the support the other backends have for it.

... Markdown code blocks, the Claude XML tags, ... My thoughts quickly wonder to JSON, but that would call for something different than regex. So maybe let's restrict the scope to regex and provide presets, or just well-documented examples, of those two.

It might be worth introducing the idea of Extractors which are post processors that run over the LLM response and extract a specific part. We can start with two different types: JSON and RegEx.

We have been discussing the idea of presets in the Discord. There are a few more features on the roadmap first (this now included) and then I want to dial in on presets.

Also just shot you an email would love to connect and talk more on these ideas. I also have a few other ideas I'm looking for feedback from users on.

SilasMarvin / lsp-ai

Post Process Extractors #41