Error in document output structure

hitsense commented 1 year ago

Hi @eyurtsev, I think there is something wrong with the structure of document extraction output. The values for each key is mixed up with some other keys, it seems. Check the attached screenshot.

Here, the value for employment period is mixed with company location. Similarly, value for employment period should actually go to skills at job I think the value assignment should go one step down.

eyurtsev commented 1 year ago

Hi! This is likely the LLM unable to handle the complexity of the task.

Things to try:

1) Add examples 2) Improve the descriptions in the schema 3) Specify an input formatter of triple_quotes if working with multi-paragraph inputs 4) Try a better model if you're not doing so already (gpt-4, text-davinci-003) 5) Break the schema into a few smaller schemas, run extractions and combine results 7) If possible to flatten the object, you can use CSV encoding which will improve results 8) Add verification / correction steps (ask an LLM to correct or verify the results of the extraction)

Feel free to share your schema if you want me to take a look at it.

If you're extracting information from a single structured source (e.g., linkedin), an LLM is not a good approach -- a traditional web-scraping would be a lot cheaper and much more reliable.

If perfect quality is needed, then even with all the hacks above, you'll need to plan on having a human in the loop as even the best LLMs will make mistakes with complex extraction tasks.

hitsense commented 1 year ago

Update -

After changing

the Field type (in input examples) from str to List[str]
encoder class from csv to json

It was able to correctly assign key to values

eyurtsev / kor

Error in document output structure #128