Ability to know how many characters were parsed

david-andrew commented 1 year ago

I'm wondering if it would be possible to add an attribute to the result for how many characters were parsed. My usecase has me parsing a string input that has random text, interspersed with multiple json objects.So I want to do something like this:

from dirtyjson import loads

text = 'example text containing {"foo":0, "bar":1} multiple json objects {"bazz":2, "boo":3} possibly separated by random text [1,2,4,7] and other junk'

while len(text) > 0:
    #skip text ahead to next object/array
    if not text.startswith('{') and not text.startswith('['):
        index = next((i for i, c in enumerate(text) if c in ('{', '[')), -1)
        if i == -1:
            break #no more objects to eat
        text = text[index:]

    #parse the current object
    chunk = loads(text)

    #do something with the json object
    print(chunk)

    #strip the object out of the string
    characters_eaten = #somehow get the number of characters used for the parse
    text = text[characters_eaten:]

But right now it's really not feasible to do this because there's not way to measure how many characters were eaten while parsing the current object. I guess technically it would be possible to use the row/column annotation of the last element in the object/list and then find the closing delimiter, but that's super cumbersome. Having the length of the characters eaten would be very useful

scottkmaxwell commented 10 months ago

I'm not going to have time to work on this but I'd happily entertain a PR if you want to take a crack at this. I'm not sure what the API would look like. It might be cleanest to introduce another function that processed the data and simply returned the length consumed rather than the payload. It would mean parsing twice but that might not be a deal killer.

david-andrew commented 10 months ago

Unfortunately the approach I ended up going with probably isn't going to work too well for this library since my case had extra context I was able to use, namely that I was parsing markdown output, and could look for opening "```json" and closing "```" tags. But yeah it basically came down to scanning for possible starts and closings to the json object, and then trying to parse just that substring.

Suffice it to say I personally don't need this feature anymore. But if anyone else wants to try to extract some value from my work, here's what I ended up writing:

def json_block_iter(message:str) -> Generator[str|Edit, None, None]:
    """
    Iterator to extract text and json objects from the LLM message.
    """
    original_message = message #for debugging
    message = message.lstrip()
    while len(message) > 0:
        try:
            i = message.index('```json')
        except ValueError:
            message = message.lstrip()
            if message:
                yield message
            return

        if i != 0:
            yield message[:i]
            message = message[i:]

        message = message[7:].lstrip()
        if not message.startswith('{') and not message.startswith('['):
            pdb.set_trace()
            raise ValueError(f"Expected json block to start with {{ or [ but found {message}")

        #find candidate end indices
        delimiter = '}' if message.startswith('{') else ']'
        end_indices = [i for i, c in enumerate(message) if c == delimiter]

        #find the first end index that is valid json
        for end_index in end_indices:
            try:
                parsed_block = dirtyjson.loads(message[:end_index+1])
                break
            except ValueError:
                continue
        else:
            raise ValueError(f"Failed to parse json block: {message}")

        # yield the block if single block, or sequentially yield each item in the list of blocks
        if isinstance(parsed_block, list):
            for item in parsed_block:
                assert 'code' in item and 'start' in item and 'end' in item, f"INTERNAL ERROR: Expected json block to have keys 'code', 'start', and 'end', but found {parsed_block}"
                yield dict(item)
        elif isinstance(parsed_block, dict):
            assert 'code' in parsed_block and 'start' in parsed_block and 'end' in parsed_block, f"INTERNAL ERROR: Expected json block to have keys 'code', 'start', and 'end', but found {parsed_block}"
            yield dict(parsed_block)
        else:
            raise ValueError(f"INTERNAL ERROR: Expected json block to be a dict or list, but found {parsed_block}")

        #update message to be the remaining text
        message = message[end_index+1:].lstrip()
        assert message.startswith('```'), f"INTERNAL ERROR: Expected json block to end with ``` but found {message}"
        message = message[3:].lstrip()
    return

This yields in sequence each of the non-json parts and each of the json parts. I think adapting it for this library might be tricky, since in general there isn't a good indicator for if a section is a valid json object or not. Conceivably there could be text containing {, }, [, or ] that isn't json, and it would be more complicated to distinguish

codecobblers / dirtyjson

Ability to know how many characters were parsed #8