Softdev1 commented 6 months ago

Description

Improve reliability of the outputs to avoid excessive false positives.

brunneis commented 6 months ago

All the LLM I've tried seem to produce hallucinations for reasons that remain unclear. I've tried various strategies to mitigate this, such as modifying prompts, simplifying contexts, and employing multi-step processes like providing step-by-step code explanations rather than presenting the code outright. Despite these efforts, there appears to be an inherent bias for false positives when a specific issue is proposed for evaluation, affecting the ability to accurately interpret requests that deviate from conventional implementations.

WIP...

kesar commented 6 months ago

are we using gpt4 ? or its gpt3 ? if its gpt4 we could give it a try to other LLMs like the new claude 3 or even llamacode. If its gpt3 we can give it a try to 4 (even if its way lower)

brunneis commented 6 months ago

are we using gpt4 ? or its gpt3 ? if its gpt4 we could give it a try to other LLMs like the new claude 3 or even llamacode. If its gpt3 we can give it a try to 4 (even if its way lower)

GPT 3.5 by default, GPT 4 is very slow. But the issue happens also with it. Trying to get access to Claude 3.

Royal-lobster commented 6 months ago

Found we can use Claude 3 opus via https://sdk.vercel.ai tho not an API. Probably it's possible to inspect network tab and use it's api for testing

brunneis commented 6 months ago

I start to see the light... GPT 3.5 is a lost battle for some cases. Got GPT 4 to work by massively minimizing the context, avoiding to add excessive details about the issue to analize. It's a very bizarre behavior.

Sample case. Asking for unchecked external calls. The IF will be completely ignored.

function withdraw() public returns (bool) {
    uint amount = pendingReturns[msg.sender];
    if (amount > 0) {
        pendingReturns[msg.sender] = 0;

>>>>>>> if (!payable(msg.sender).send(amount)) {
            pendingReturns[msg.sender] = amount;
            return false;
        }
    }
    return true;
}

brunneis commented 6 months ago

❌ claude-3-opus (WTF) ✅ claude-3-sonnet ❌ claude-3-haiku ✅ gpt-4 ❌ gpt-3.5

Will try with some coding-specific LLMs...

kesar commented 6 months ago

aweeesome

Royal-lobster commented 6 months ago

❌ claude-3-opus (WTF) ✅ claude-3-sonnet ❌ claude-3-haiku ✅ gpt-4 ❌ gpt-3.5

Will try with some coding-specific LLMs...

Wow claude-3-opus failing but sonnet succeeding is wild

brunneis commented 6 months ago

❌ claude-3-opus (WTF) ✅ claude-3-sonnet ❌ claude-3-haiku ✅ gpt-4 ❌ gpt-3.5 Will try with some coding-specific LLMs...

Wow claude-3-opus failing but sonnet succeeding is wild

When increasing temperature it works well, though.

brunneis commented 6 months ago

Works better now with Sonnet, Opus and GPT-4.
Promising results with Mistral 7B Instruct v0.2 and Haiku by removing hallucination triggers (asking for severity seems to bias some models to detect the issue).

kesar commented 6 months ago

is the demo url working with that to test it out a bit?

brunneis commented 6 months ago

is the demo url working with that to test it out a bit?

Yes, deployed with Claude 3 Sonnet right now.

EveripediaNetwork / issues

Improve LLM Issue-Detection Hallucinations #2341

Description