Qarj / fix-busted-json

Fix broken json using Python
MIT License
14 stars 1 forks source link

Doesn't work if LLM forgets the last } of JSON #5

Open ejkitchen opened 1 month ago

ejkitchen commented 1 month ago

This happens a lot with LLAMA3 70B from fix_busted_json import repair_json

invalid_json = "{ name: 'John' "

fixed_json = repair_json(invalid_json)  

Throws an exception: Exception has occurred: IndexError string index out of range

Qarj commented 4 weeks ago

Really, I've never seen that happen with llama3 70b, are you using quantised? You say it often leaves off the closing "}" ?

Also the name should be "name" and the string should be double quotes "John".

Are you instructing it to output valid JSON, it doesn't look like it is even trying to output valid JSON there are so many errors with that and it is a tiny object.

I suggest giving the model an example response with correctly formed JSON (be sure to use double quotes and so on, check it in a JSON validator), then ask it to give a valid JSON response.

Another thing to check out is that you can sometimes force it to output valid JSON depending on what options are available with the service you are using.

Qarj commented 4 weeks ago

As well as giving a full example in the prompt, you can end the prompt with something like this:

Response as well formed and properly escaped JSON object:

{
  "thought": "<mandatory>",
  "action": "<mandatory>",
  "actionInput": "<mandatory>"
}

Take a deep breath and work on this problem step-by-step.

Make sure your examples are cleanly presented and 100% valid, if they are sloppy the model tries to do the same.

ejkitchen commented 4 weeks ago

Hi Qarj,

Thank you for your response! I should have been a bit clearer. I didn't want to overwhelm you with the prompt and data but I have created a sample that shows some of the issues that you may or may not want to fix. The biggest one is when LLAMA3 or Mixtral every once in a while use curly quotes in JSON instead of regular quotes and more often than not it's the closing quote that is curly whilst the opening one is a standard one. I also notice it's on very specific inputs which seem to trigger it (the full prompt with request and data contain no quotes whatsoever) . I am running my prompt against Groq's implementation of LLAMA3 70B and the smaller Mixtral as well as Mixtral direct from Mistral for 8x22B as well as a local instance of 8 x LLAMA3 on a DGX H100 where each GPU has an instance of a quantized version of LLAMA3. My prompt is about 1k tokens and the data is about 200 tokens so I am well within the context window length.

The prompts I am using use 2 shot examples along with all of the other tricks people use to make it behave most of the time. When I say it happens a lot, I mean I am sending 100k records 1 by 1 to all GPUs and I would say every 100 or so attempts I get the curly quotes or the last "}" is missing. So right now what I do is check for both of those cases. If you could fix the curly quotes, that would be great. As far as the missing }, that may be more risky because there may be unintended consequences to doing that especially if the model was not done for some reason and did an early stop. People may not want that.

Here is a quick tester I wrote to illustrate some of the issues. You will notice that fix_busted_json sometimes tolerates a missing quote here and there but it depends on where it is found (beginning or end).

import json
import logging
from typing import List
from fix_busted_json import repair_json

def setup_logging() -> None:
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(message)s",
        level=logging.INFO
    )

def try_json_repair(json_to_repair: str) -> bool:
    """
    Attempt to repair a JSON string and validate it.

    Args:
        json_to_repair (str): The JSON string to repair.

    Returns:
        bool: True if the JSON was successfully repaired and loaded, False otherwise.
    """
    try:
        repaired_json = repair_json(json_to_repair)
    except Exception as e:
        logging.error(f"fix_busted_json threw an exception during the repair process: {e}")
        return False

    logging.info("fix_busted_json did not throw an exception, now trying json.loads")

    try:
        json_data = json.loads(repaired_json)
        logging.info("JSON data from fix_busted_json loads. Test passed.")
        logging.info(f"Original JSON data: {json_to_repair}")
        logging.info(json.dumps(json_data, indent=2, ensure_ascii=False))
        return True
    except json.JSONDecodeError as e:
        logging.error(f"JSON decode error after repair attempt: {e}")
    except Exception as e:
        logging.error(f"Unexpected error in repaired_json after calling fix_busted_json: {e}")

    return False

def run_tests(test_cases: List[str]) -> None:
    """
    Run a series of JSON repair tests.

    Args:
        test_cases (List[str]): A list of JSON strings to test.
    """
    failures = 0

    for i, json_to_repair in enumerate(test_cases, start=1):
        if not try_json_repair(json_to_repair):
            failures += 1
            logging.info(f"Test case #{i} failed")
        else:
            logging.info(f"Test case #{i} passed")

        logging.info("=" * 40)

    total_tests = len(test_cases)
    logging.info(f"Total failures: {failures}")
    logging.info(f"Total passed: {total_tests - failures}")

def main() -> None:
    setup_logging()
    #if "”" in content or "“" in content:

    test_cases = [
        """
        {
            "name": "Alice",
            ”age“: 26,
        }
        """,
        """
        {
            "name": "Alice",
            "age”: 25,
        }
        """,
        """
        {
            "name": ”Alice,
            age: 26,
        }
        """,
        """
        {
            "name": "Alice",
            "age": 27        
        """,
        """
        {
            "name": Alice,
            "age": 28
        }
        """,
        """
        {
            'name': 'Alice",
            "age": 29
        }
        """,
        """
        {
            name: "Alice",
            "age: 30        
        }
        """,
        """
        {
            name: "Alice",
            'age: 31        
        }
        """,
        """
        {
            name: "Alice",
            age': 32       
        }
        """,
        """
        {
            "name": Alice,
            age: 33,
        }
        """,        
        """
        {
            'name': 'Alice,
            age: 34,
        }
        """,
        """
        {
            "name': "Alice",
            'age': 35       
        }
        """,
        """
        {
            "name": "Alice",
            "age": 36
        """,
        # ============================
        # The following tests all pass
        # ============================
        """
        {
            "name": "Alice",
            “age”: 49,
        }
        """,
        """
        {
            name: "Alice",
            age: 50,
        }
        """,
        """
        {
            "name": "Alice",
            'age': 51       
        }
        """,
        # This one passes the test however it is incorrect because it's a curly quote issue that somehow gets included in the actual JSON key
        """
        {
            "name": "Alice",
            age”: 52,
        }
        """,
        # This one passes the test however it is incorrect because it's a curly quote issue that somehow gets included in the actual JSON key
        """
        {
            "name": "Alice",
            ”age: 53,
        }
        """,
        """
        { name: 'John' 'age': 80, 'city': 'New' + ' York', }
        """
    ]

    run_tests(test_cases)

if __name__ == "__main__":
    main()
ejkitchen commented 4 weeks ago

I don't know if this helps but had GPT4o have a quick look.

Analysis of the Code and Where it Fails for Curly Quotes

The primary functions that handle quotes in the JsonParser class are get_quote, check_quote, eat_quoted_key, eat_string, and eat_char_or_escaped_char.

get_quote

The get_quote method identifies the type of quote at the current position:

def get_quote(self):
    if self.inspected[self.position] == "'":
        return "'"
    if self.inspected[self.position] == '"':
        return '"'
    if self.inspected[self.position] == '`':
        return '`'
    if self.inspected[self.position] == '“':
        return '”'
    if self.inspected[self.position] == '\\' and self.inspected[self.position + 1] == '"':
        return '\\"'
    if (
        self.inspected[self.position] == '\\' and
        self.inspected[self.position + 1] == '\\' and
        self.inspected[self.position + 2] == '"'
    ):
        return '\\\\"'
    return False

This method identifies single quotes ('), double quotes ("), backticks (`), and escaped double quotes, including double and triple escapes. It also handles opening curly quotes (), but it does not handle closing curly quotes ().

check_quote

The check_quote method checks if the current position matches the end quote:

def check_quote(self, quote):
    if len(quote) == 1:
        return self.inspected[self.position] == quote
    if len(quote) == 2:
        return (
            self.inspected[self.position] == quote[0] and
            self.inspected[self.position + 1] == quote[1]
        )
    if len(quote) == 3:
        return (
            self.inspected[self.position] == quote[0] and
            self.inspected[self.position + 1] == quote[1] and
            self.inspected[self.position + 2] == quote[2]
        )
    return False

The check_quote method handles multi-character quotes but doesn't explicitly manage curly quotes.

Improvements to Handle Curly Quotes

To properly handle curly quotes, the get_quote and check_quote methods need to be updated. Additionally, the eat_char_or_escaped_char method should consider curly quotes.

Update get_quote to Handle Curly Quotes

Add support for closing curly quotes in get_quote:

def get_quote(self):
    if self.inspected[self.position] == "'":
        return "'"
    if self.inspected[self.position] == '"':
        return '"'
    if self.inspected[self.position] == '`':
        return '`'
    if self.inspected[self.position] == '“':
        return '“'
    if self.inspected[self.position] == '”':
        return '”'
    if self.inspected[self.position] == '\\' and self.inspected[self.position + 1] == '"':
        return '\\"'
    if (
        self.inspected[self.position] == '\\' and
        self.inspected[self.position + 1] == '\\' and
        self.inspected[self.position + 2] == '"'
    ):
        return '\\\\"'
    return False

Update check_quote to Handle Curly Quotes

Ensure check_quote can verify curly quotes:

def check_quote(self, quote):
    if len(quote) == 1:
        return self.inspected[self.position] == quote
    if len(quote) == 2:
        return (
            self.inspected[self.position] == quote[0] and
            self.inspected[self.position + 1] == quote[1]
        )
    if len(quote) == 3:
        return (
            self.inspected[self.position] == quote[0] and
            self.inspected[self.position + 1] == quote[1] and
            self.inspected[self.position + 2] == quote[2]
        )
    return False

Update eat_char_or_escaped_char to Handle Curly Quotes

Modify eat_char_or_escaped_char to handle curly quotes:

def eat_char_or_escaped_char(self, quote):
    if self.debug:
        print('eat_char_or_escaped_char', self.position, self.inspected[self.position])
    if self.position >= len(self.inspected):
        raise JsonFixError('Unexpected end of quoted key or string')
    if self.debug:
        print(
            'eatCharOrEscapedChar',
            self.position,
            self.inspected[self.position],
            ' ' + str(ord(self.inspected[self.position])),
        )
    if not self.check_quote(quote) and self.inspected[self.position] == '\\':
        if self.is_triple_escaped_double_quote():
            self.log('eatCharOrEscapedChar triple escaped double quote')
            self.position += 1
            self.position += 1
        if self.is_double_escaped_double_quote():
            self.log('eatCharOrEscapedChar double escaped double quote')
            self.position += 1
        if (quote == "'" or quote == '`' or quote == '“' or quote == '”') and self.inspected[self.position + 1] == quote:
            pass
        else:
            self.quoted += self.inspected[self.position]
        self.position += 1
    if (quote == "'" or quote == '`' or quote == '“' or quote == '”') and self.inspected[self.position] == '"':
        self.quoted += '\\'
    if self.inspected[self.position] == '\n':
        self.quoted += '\\n'
        self.log('eatCharOrEscapedChar unescaped newline')
    else:
        self.quoted += self.inspected[self.position]
    self.position += 1

Conclusion

By updating get_quote, check_quote, and eat_char_or_escaped_char to handle curly quotes, the library will be able to parse and repair JSON strings that use curly quotes. This will address the current issues with parsing curly quotes and ensure the JSON strings are correctly repaired and processed.

Qarj commented 4 weeks ago

Hi @ejkitchen

It is interesting to see those cases. These examples illustrate some of the dangers to repairing JSON - most of the fixes I've put in place so far are pretty non ambiguous - there are a few ambiguous that have crept in - like repairing when there is consistent use of is single or curly quotes. The issue though is that a key name and value is allowed to start or end with a curly quote or whatever and I decided if it looked like a certain quote was intended then assume that is the quote, but if it isn't paired then assume it is a value.

So the question is - should the default behaviour be that quotes that appear at the start or or end of a key name be assumed to be quote no matter what? Or maybe I should introduce an extreme mode that attempts more dangerous repairs.

There are many scenarios and things can get complicated, just to give one example:

{  "'": 45 } 

Is the key name ' or is it the empty string?

Needs a bit of thought.

The gpt-4o suggestions look good, you could try it out and see how it goes.

Regarding the missing closing brace, maybe that can be handled also though is perhaps a bit dangerous, I'm not sure but if it only attempt a fix in extreme mode it might be ok.

Qarj commented 4 weeks ago

Hi again @ejkitchen

I'm just thinking for your particular use case you might be able to do some quick and dirty preprocessing that will deal with some of the issues you face:

then in post processing, look at each key name, if it begins with or ends with some kind of quote character, remove it

Though I don't think { "name": Alice } is supported at all is it?

just some thoughts...

Qarj commented 4 weeks ago

I've done some work on this on the JavaScript version of this package here: https://github.com/Qarj/log-parsed-json

Check the readme. Some cases still need to be covered like where the starting quote is present and the ending quote is missing entirely...

Qarj commented 4 weeks ago

Here is another idea - for the failed parses, prepare another prompt focused on just repairing the JSON Hey Llama3, here is some broken JSON... and give it detailed instructions on what to look out for. Perhaps have multiple variations of this prompt and send it to each in turn. Maybe try other models that specialise in code.

If all else fails, finally send it to gpt-4o enabling the option that forces it to output in valid JSON. Unless it is utter gibberish, gpt-4o will repair it. It should be so rare that you need to do this it should be financially feasible I'm guessing...