ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.67k stars 1.27k forks source link

[1.14.0+] pydantic ValidationError with SmartScraperGraph #598

Closed bezineb5 closed 1 month ago

bezineb5 commented 2 months ago

Describe the bug Since version 1.14.0 (tested here on 1.15.0), SmartScraperGraph on OpenAI stopped working with a pydantic-related error:

/venv/lib/python3.12/site-packages/google_crc32c/__init__.py:29: RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension                                                                                        warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/llm_scraper/test_bug.py", line 50, in <module>                                                                                                        result = smart_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run                                                      self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute                                                           return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard                                                 raise e
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard                                                 result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 129, in execute                                                  answer = chain.invoke({"question": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke                                                             input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 192, in invoke                                                         return self._call_with_config(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config                                                  context.run(
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 397, in call_func_with_variable_args                                      return func(input, **kwargs)  # type: ignore[call-arg]
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 193, in <lambda>                                                       lambda inner_input: self.parse_result([Generation(text=inner_input)]),
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/load/serializable.py", line 113, in __init__                                                         super().__init__(*args, **kwargs)
  File "/venv/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__                                                                         raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)

To Reproduce Run the example: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py

I also tried to update to the latest langchain, but that didn't help. It could be langchain-related, as I can see similar issues (but not exactly the same) on their issues list. Or it's just a mix of pydantic versions.

Expected behavior It should succeed.

Screenshots

Desktop (please complete the following information):

rjbks commented 1 month ago

@rjbks

Output from above with gemini 1.5 pro (openai 4o mini still causes original error where expects string but gets pydantic model):

Thank you! Can you also share full output of gpt-4o-mini?

@LorenzoPaleari

Certainly:

--- Executing SearchInternet Node ---
Search Query: Centro Universitário Franciscano UNIFRA medical school information 2014-2018
--- Executing GraphIterator Node with batchsize 1 ---
Running graph instance for https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx
--- Executing Fetch Node ---
--- (Fetching HTML from: https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Running graph instance for https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae
--- Executing Fetch Node ---
--- (Fetching HTML from: https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
    res = sg.run()
          ^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
    raise e
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 64, in execute
    state = self._async_execute(state, batchsize)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 115, in _async_execute
    futures.append(graph.run())
                   ^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
    raise e
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 136, in execute
    answer = chain.invoke({"question": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
    input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 192, in invoke
    return self._call_with_config(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config
    context.run(
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 398, in call_func_with_variable_args
    return func(input, **kwargs)  # type: ignore[call-arg]
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 193, in <lambda>
    lambda inner_input: self.parse_result([Generation(text=inner_input)]),
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/load/serializable.py", line 113, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)
LorenzoPaleari commented 1 month ago

@rjbks @VinciGit00 Perfect! Thank you for the log!

OpenAI Error: It breaks on the second iteration confirming my theory that the problem is in the llm instantiation.

--- Executing SearchInternet Node ---
Search Query: Centro Universitário Franciscano UNIFRA medical school information 2014-2018

### First iteration started ###

--- Executing GraphIterator Node with batchsize 1 ---
Running graph instance for https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx
--- Executing Fetch Node ---
--- (Fetching HTML from: https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---

### First iteration ended ###

### Second iteration started ###

Running graph instance for https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae
--- Executing Fetch Node ---
--- (Fetching HTML from: https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
    res = sg.run()
          ^^^^^^^^
 [Not relevant]

On GenerateAnswerNode we have a piece of code that choses the correct parser for the alms (it need to be updated since most of the models support this methods, but that's not the issue now)

if self.node_config.get("schema", None) is not None:
            print("Schema is not None")
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema = self.node_config["schema"],
                    method="function_calling") # json schema works only on specific models

                # default parser to empty function
                def output_parser(x):
                    return x
                if is_basemodel_subclass(self.node_config["schema"]):
                    print("Schema is a pydantic model")
                    output_parser = dict
                format_instructions = "NA"
            else:
                print("llm is not Openai")
                output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

        else:
            output_parser = JsonOutputParser()
            format_instructions = output_parser.get_format_instructions()

I added the prints

What happens here is: the first iteration of SmartScraperGraph the LLM is correctly initialised and I get the expected output

Schema is not None
Schema is a pydantic model

But the second time SmartScraperGraph is called and sub sequentially also GenerateAsnwerNode, I observe

Schema is not None
llm is not Openai

That means there is a problem with the initialisation of the llm that get lost in the way (since SmartScraperGraph class is initialised just once) Wrong parser is selected and this ends up in errors and weird stuff happening.

@VinciGit00

I can add all the other models that should be able to use with_strctured_output()(although I cannot test them, I do not have keys for everyone, I'll just rely on documentation). But the initialisation issue is much rooted in the code and it would be difficult for me to fix that.

@rjbks I'm guessing that with some specific pydantic version being used the "wrong parser" is able to parse all the OpenAi output, while the correct one should work for Pydantic, langchain Pydantic and Dict

For Google error, it is possible that adding CahtGoogleGenerativeAI here:

if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):

and fixing the initialisation will be enough to fix all errors

rjbks commented 1 month ago

@LorenzoPaleari so this does not cover the openai error where it expects a string but gets a pydantic model then right?

LorenzoPaleari commented 1 month ago

@rjbks

Gemini

I do not have keys for Gemini-Pro.

I have one fix in mind that can maybe work, if you are willing to try that for me.

In file scrapegraphai/nodes/generate_answer_node.py on line 93:

            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):

modify it to:

            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI, ChatGoogleGenAI)):

The import is: from langchain_google_genai import ChatGoogleGenAI

It should break at the second iteration now.

FIX Why it should be better?

JsonOutputParser that is currently used for most of the models a part for ChatOpenAI and ChatMistralAI is actually not recommended from langchain. with_structured_output() instead is the way to go for correctly parsing the result.

I think that by changing the parser to the better one, we should see an improvement as soon as we fix also the initialisation error.

OpenAi

@LorenzoPaleari so this does not cover the openai error where it expects a string but gets a pydantic model then right?

It actually covers that error also. Since the llm on the second iteration is not correctly instantiated, the if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)): will fail and we end up using the wrong parser for ChatOpenAI, same wrong parser that was also causing the error at the beginning.

rjbks commented 1 month ago

@LorenzoPaleari Opneai still throws same errors for both mini and 4o. For Google genai proi and flash function calling is now an issue:

Model google_genai/gemini-1.5-pro not found, 
                  using default token size (8192)
Model google_genai/gemini-1.5-pro not found, 
                  using default token size (8192)
--- Executing SearchInternet Node ---
Search Query: "Centro Universitário Franciscano" medical school UNIFRA 2014..2018
--- Executing GraphIterator Node with batchsize 16 ---
processing graph instances:   0%|                                                                                                                                                                            | 0/3 [00:00<?, ?it/s]--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.scielo.br/j/reben/a/GZ9xDNvrHZNL3XjScnXHWmD/?format=pdf&lang=en) ---
--- Executing Fetch Node ---
--- (Fetching HTML from: https://pubmed.ncbi.nlm.nih.gov/25590204/) ---
--- Executing Fetch Node ---
--- (Fetching HTML from: https://unifra.academia.edu/SolangeFagan) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
processing graph instances:   0%|                                                                                                                                                                            | 0/3 [00:00<?, ?it/s]
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
    res = sg.run()
          ^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
    raise e
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 72, in execute
    state = asyncio.run(self._async_execute(state, batchsize))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 128, in _async_execute
    answers = await tqdm.gather(
              ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/tqdm/asyncio.py", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
           ^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/tasks.py", line 631, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 117, in _async_run
    return await asyncio.to_thread(graph.run)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
    raise e
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 95, in execute
    self.llm_model = self.llm_model.with_structured_output(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_google_genai/chat_models.py", line 1186, in with_structured_output
    raise ValueError(f"Received unsupported arguments {kwargs}")
ValueError: Received unsupported arguments {'method': 'function_calling'}
LorenzoPaleari commented 1 month ago

@rjbks @VinciGit00 I was able to fix the error for both ChatOpenAI and GoogleGenAI (tested on gemini-1.5-pro).

I will open a pull request for this error. At the end it was related with the initialisation problem.

rjbks commented 1 month ago

@rjbks @VinciGit00 I was able to fix the error for both ChatOpenAI and GoogleGenAI (tested on gemini-1.5-pro).

I will open a pull request for this error. At the end it was related with the initialisation problem.

@LorenzoPaleari Is this https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/662 the PR?

LorenzoPaleari commented 1 month ago

@rjbks https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/664

664

VinciGit00 commented 1 month ago

ok please guys update to the new beta

rjbks commented 1 month ago

This issue still persists in the new release 1.19.0, on my windows machine (works on my mac), also on 1.19.0b9, but switching to 1.19.0b11 fixed it. Same setup, code, versions. text key is expected to hold string value.

Did a full env re-install to be sure it wasn't a version issue. Only reverting to 1.19.0b11 fixed it.

@LorenzoPaleari

VinciGit00 commented 1 month ago

It should work on the beta

LorenzoPaleari commented 1 month ago

This issue still persists in the new release 1.19.0, on my windows machine (works on my mac), also on 1.19.0b9, but switching to 1.19.0b11 fixed it. Same setup, code, versions. text key is expected to hold string value.

Did a full env re-install to be sure it wasn't a version issue. Only reverting to 1.19.0b11 fixed it.

@LorenzoPaleari

1.19.0 is behind 1.19.0-beta11, so if it works in beta11 it is fine. It should work also in beta10 where the fix was merged