ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.87k stars 1.29k forks source link

Scrapegraphai not able to parse json output #782

Closed madguy02 closed 2 weeks ago

madguy02 commented 2 weeks ago

Describe the bug Whenever we are trying to scrape a website using scrapegraphai, we are facing an issue with the parsing of the json output

To Reproduce Steps to reproduce the behavior: I am running the example code, given here: https://github.com/ScrapeGraphAI/Scrapegraph-ai but i still face this issue:

--- Executing Fetch Node ---
--- (Fetching HTML from: https://scrapegraphai.com/) ---
--- Executing ParseNode Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/langchain_core/output_parsers/json.py", line 83, in parse_result
    return parse_json_markdown(text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/langchain_core/utils/json.py", line 144, in parse_json_markdown
    return _parse_json(json_str, parser=parser)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/langchain_core/utils/json.py", line 160, in _parse_json
    return parser(json_str)
           ^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/langchain_core/utils/json.py", line 118, in parse_partial_json
    return json.loads(s, strict=strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/manishkakoti/Desktop/projects/learning-AI/scrapegraph.py", line 36, in <module>
    scrapergraph()
  File "/home/manishkakoti/Desktop/projects/learning-AI/scrapegraph.py", line 32, in scrapergraph
    result = smart_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 212, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 327, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 274, in _execute_standard
    raise e
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 247, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 172, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 129, in execute
    answer = output_parser.parse(raw_response)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/langchain_core/output_parsers/json.py", line 97, in parse
    return self.parse_result([Generation(text=text)])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/manishkakoti/Desktop/projects/learning-AI/venv/lib64/python3.11/site-packages/langchain_core/output_parsers/json.py", line 86, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: content='{\n  "company_name": "ScrapeGraphAI",\n  "description": "ScrapeGraphAI is our popular open-source library designed for efficient web data extraction. It enables developers to scrape and extract structured information from websites using advanced AI techniques. Due to its success and high demand, we are building a SaaS platform around it to make this powerful tool even more accessible to non-developers.",\n  "contact_email": "contact@scrapegraphai.com"\n}' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 96, 'prompt_tokens': 2430, 'total_tokens': 2526, 'completion_tokens_details': {'audio_tokens': None, 'reasoning_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_0ba0d124f1', 'finish_reason': 'stop', 'logprobs': None} id='run-42298cd8-4041-4fd7-a92c-62f837ee373c-0' usage_metadata={'input_tokens': 2430, 'output_tokens': 96, 'total_tokens': 2526, 'input_token_details': {'cache_read': 0}, 'output_token_details': {'reasoning': 0}}
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE

any idea how this could be fixed (i am using the latest scrapegraphai )

Expected behavior it should give the output without any errors

Desktop (please complete the following information):

def scrapergraph(): graph_config = { "llm": { "api_key": openai_api_key, "model": "openai/gpt-4o-mini", "temperature": 0, }, "verbose": True, "headless": False, }

smart_scraper_graph = SmartScraperGraph(
    prompt="Find some information about what does the company do, the name and a contact email.",
    source="https://scrapegraphai.com/",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

scrapergraph()

madguy02 commented 2 weeks ago

@VinciGit00 can you help me with this? is this because of the os i am using(fedora), as i am not able to install playwright as its not officially supported.

VinciGit00 commented 2 weeks ago

which version is this?

madguy02 commented 2 weeks ago

I am using the latest version scrapegraphai 1.28.0

aziz-ullah-khan commented 2 weeks ago

@VinciGit00, facing the same issue.

aziz-ullah-khan commented 2 weeks ago

@madguy02 @VinciGit00 , resolved this issue.

madguy02 commented 2 weeks ago

@aziz-ullah-khan how were you able to resolve it?

aziz-ullah-khan commented 2 weeks ago

Yes

itsmrhem commented 2 weeks ago

Facing the same issue. @aziz-ullah-khan what worked ?

VinciGit00 commented 2 weeks ago

please update to the new beta

madguy02 commented 2 weeks ago

thanks a lot @VinciGit00 , i am able to resolve this issue with this: https://github.com/ScrapeGraphAI/Scrapegraph-ai/releases , please use the latest beta4, it has the bug resolved

itsmrhem commented 2 weeks ago

Thanks a ton!. Tried with 1.28.0-beta.4 and it works.