ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.67k stars 1.27k forks source link

Getting Langchain Output Paser Exception (Invalid JSON output) #324

Closed nashugame closed 5 months ago

nashugame commented 5 months ago

Describe the bug I am using SmartScraper graph to scrape data from a website. It is giving me Invalid JSON output error.

To Reproduce This is my graph_config, for the rest of code I am following the tutorial. I using latest release fo ScrapeGraphAI. The website source: https://www.sortlist.com/ prompt: Give me a summary of top 10 advertising agencies

graph_config = {
    "llm": {
        "model": "groq/llama3-8b-8192",
        "api_key": groq_key,
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": base_url,  # set Ollama URL
    },
    "headless": False
}

Screenshots

Note that the output is a JSON object with a single property `links` which is an array of URLs.
Traceback (most recent call last):
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/json.py", line 66, in parse_result
    return parse_json_markdown(text)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/utils/json.py", line 147, in parse_json_markdown
    return _parse_json(json_str, parser=parser)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/utils/json.py", line 160, in _parse_json
    return parser(json_str)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/utils/json.py", line 120, in parse_partial_json
    return json.loads(s, strict=strict)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 14 column 5 (char 519)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/chainlit/utils.py", line 40, in wrapper
    return await user_function(**params_values)
  File "/Users/satyamkumar/development/pocs/python/webscraper-scrapegraph/test.py", line 64, in main
    result = json.loads(user_scrapper_graph.run())
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 118, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py", line 171, in execute
    return self._execute_standard(initial_state)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py", line 110, in _execute_standard
    result = current_node.execute(state)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 124, in execute
    answer = map_chain.invoke({"question": user_prompt})
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 3142, in invoke
    output = {key: future.result() for key, future in zip(steps, futures)}
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 3142, in <dictcomp>
    output = {key: future.result() for key, future in zip(steps, futures)}
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 2499, in invoke
    input = step.invoke(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/base.py", line 169, in invoke
    return self._call_with_config(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 1626, in _call_with_config
    context.run(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/config.py", line 347, in call_func_with_variable_args
    return func(input, **kwargs)  # type: ignore[call-arg]
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/base.py", line 170, in <lambda>
    lambda inner_input: self.parse_result(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/json.py", line 69, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: Here is the JSON output:

Desktop (please complete the following information):

VinciGit00 commented 5 months ago

please try this configuration:

graph_config = {
    "llm": {
        "model": "groq/llama3-8b-8192",
        "api_key": groq_key,
        "temperature": 0,
        "format": "json"
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": base_url,  # set Ollama URL
    },
    "headless": False
}
nashugame commented 5 months ago

Hi @VinciGit00 , I am still getting the same error with your syggested configuration. I am attaching the logs for your reference

2024-06-02 17:52:11 - Loaded .env file
2024-06-02 17:52:14 - Your app is available at http://localhost:8000
2024-06-02 17:52:16 - Translated markdown file for en-US not found. Defaulting to chainlit.md.
2024-06-02 17:55:22 - 1 change detected
2024-06-02 17:55:22 - File modified: main.py. Reloading app...
2024-06-02 17:55:24 - Translated markdown file for en-US not found. Defaulting to chainlit.md.
Give me a summary of top 10 advertising agencies
https://www.sortlist.com/
2024-06-02 17:56:12 - Starting scraping...
2024-06-02 17:56:18 - Content scraped
2024-06-02 17:56:27 - Loading faiss.
2024-06-02 17:56:27 - Successfully loaded faiss.
2024-06-02 17:56:37 - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-02 17:56:38 - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-02 17:56:39 - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-02 17:56:39 - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-02 17:56:39 - Invalid json output: Here is the JSON output:

{
  "data": [
    {
      "url": "https://www.sortlist.com/recording",
      "category": "recording"
    },
    {
      "url": "https://www.sortlist.com/audio-mastering",
      "category": "audio-mastering"
    },
    {
      "url": "https://www.sortlist.com/design",
      "category": "design"
    },
    ...
  ]
}

Note that I've only included the first few items in the list. If you'd like me to continue processing the rest of the list, please let me know!
Traceback (most recent call last):
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/json.py", line 66, in parse_result
    return parse_json_markdown(text)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/utils/json.py", line 147, in parse_json_markdown
    return _parse_json(json_str, parser=parser)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/utils/json.py", line 160, in _parse_json
    return parser(json_str)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/utils/json.py", line 120, in parse_partial_json
    return json.loads(s, strict=strict)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 15 column 5 (char 306)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/chainlit/utils.py", line 40, in wrapper
    return await user_function(**params_values)
  File "/Users/satyamkumar/development/pocs/python/webscraper-scrapegraph/test.py", line 64, in main
    result = user_scrapper_graph.run()
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 118, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py", line 171, in execute
    return self._execute_standard(initial_state)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py", line 110, in _execute_standard
    result = current_node.execute(state)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 124, in execute
    answer = map_chain.invoke({"question": user_prompt})
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 3142, in invoke
    output = {key: future.result() for key, future in zip(steps, futures)}
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 3142, in <dictcomp>
    output = {key: future.result() for key, future in zip(steps, futures)}
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 2499, in invoke
    input = step.invoke(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/base.py", line 169, in invoke
    return self._call_with_config(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/base.py", line 1626, in _call_with_config
    context.run(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/runnables/config.py", line 347, in call_func_with_variable_args
    return func(input, **kwargs)  # type: ignore[call-arg]
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/base.py", line 170, in <lambda>
    lambda inner_input: self.parse_result(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/langchain_core/output_parsers/json.py", line 69, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: Here is the JSON output:

{
  "data": [
    {
      "url": "https://www.sortlist.com/recording",
      "category": "recording"
    },
    {
      "url": "https://www.sortlist.com/audio-mastering",
      "category": "audio-mastering"
    },
    {
      "url": "https://www.sortlist.com/design",
      "category": "design"
    },
    ...
  ]
}

Note that I've only included the first few items in the list. If you'd like me to continue processing the rest of the list, please let me know!
f-aguzzi commented 5 months ago

This happens all the time. It's the LLM outputting an invalid JSON file because it adds phrases and/or suspension dots within the code. It's a recurring issue when working with LLMs, especially with smaller models like the llama3-8b you're using. There's not much that can be done.

Let's take a look at the output from your first log.

Here is the JSON output:

{
  "data": [
    {
      "url": "https://www.sortlist.com/recording",
      "category": "recording"
    },
    {
      "url": "https://www.sortlist.com/audio-mastering",
      "category": "audio-mastering"
    },
    {
      "url": "https://www.sortlist.com/design",
      "category": "design"
    },
    ...
  ]
}

It literally wrote "Here's the JSON output:" within the JSON file, and added suspension dots after the last element. You can see something even worse on the second output, too, where it wrote "Note that I've only included the first few items in the list. If you'd like me to continue processing the rest of the list, please let me know!" at the end. This model was clearly trained to be a chatbot and it can't resist the temptation to talk too much, even if the system prompt provided by ScrapeGraph is very clear on only outputting the JSON.

Sometimes you can work around the problem by giving a less declarative, more descriptive prompt, but it's not guaranteed. In your case, "Summary of top 10 advertising agencies" instead of "Give me a summary of top 10 advertising agencies" might do the trick. If this doesn't work either, you might have to use a different LLM.

VinciGit00 commented 5 months ago

Hi, please try with the new beta

PeriniM commented 5 months ago

Hey @nashugame created a new issue #332 from discussion to use Pydantic schema validation. It will also be up to the size of the model but feel free to contribute!

nashugame commented 5 months ago

Hi @VinciGit00 Getting this with new beta

2024-06-03 16:22:15 - "Groq" object has no field "format"
Traceback (most recent call last):
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/chainlit/utils.py", line 40, in wrapper
    return await user_function(**params_values)
  File "/Users/satyamkumar/development/pocs/python/webscraper-scrapegraph/main.py", line 47, in on_chat_start
    smart_scraper_graph = SmartScraperGraph(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 52, in __init__
    super().__init__(prompt, config, source, schema)
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/abstract_graph.py", line 81, in __init__
    self.graph = self._create_graph()
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 85, in _create_graph
    generate_answer_node = GenerateAnswerNode(
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 48, in __init__
    self.llm_model.format="json"
  File "/opt/miniconda3/envs/source-x-ai/lib/python3.10/site-packages/pydantic/v1/main.py", line 357, in __setattr__
    raise ValueError(f'"{self.__class__.__name__}" object has no field "{name}"')
ValueError: "Groq" object has no field "format"
VinciGit00 commented 5 months ago

hi, the main problem is the model you are using, please use another one, maybe with Ollama