Closed bezineb5 closed 1 month ago
@rjbks
Output from above with gemini 1.5 pro (openai 4o mini still causes original error where expects string but gets pydantic model):
Thank you! Can you also share full output of gpt-4o-mini?
@LorenzoPaleari
Certainly:
--- Executing SearchInternet Node ---
Search Query: Centro Universitário Franciscano UNIFRA medical school information 2014-2018
--- Executing GraphIterator Node with batchsize 1 ---
Running graph instance for https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx
--- Executing Fetch Node ---
--- (Fetching HTML from: https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Running graph instance for https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae
--- Executing Fetch Node ---
--- (Fetching HTML from: https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
res = sg.run()
^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
raise e
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 64, in execute
state = self._async_execute(state, batchsize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 115, in _async_execute
futures.append(graph.run())
^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
raise e
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 136, in execute
answer = chain.invoke({"question": user_prompt})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
input = context.run(step.invoke, input, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 192, in invoke
return self._call_with_config(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config
context.run(
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 398, in call_func_with_variable_args
return func(input, **kwargs) # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 193, in <lambda>
lambda inner_input: self.parse_result([Generation(text=inner_input)]),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/load/serializable.py", line 113, in __init__
super().__init__(*args, **kwargs)
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__
raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
str type expected (type=type_error.str)
@rjbks @VinciGit00 Perfect! Thank you for the log!
OpenAI Error: It breaks on the second iteration confirming my theory that the problem is in the llm instantiation.
--- Executing SearchInternet Node ---
Search Query: Centro Universitário Franciscano UNIFRA medical school information 2014-2018
### First iteration started ###
--- Executing GraphIterator Node with batchsize 1 ---
Running graph instance for https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx
--- Executing Fetch Node ---
--- (Fetching HTML from: https://caper.ca/sites/default/files/pdf/CAPER_MedicalSchools_September_2022.xlsx) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
### First iteration ended ###
### Second iteration started ###
Running graph instance for https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae
--- Executing Fetch Node ---
--- (Fetching HTML from: https://ufn.academia.edu/ErickKaderCallegaro/CurriculumVitae) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
res = sg.run()
^^^^^^^^
[Not relevant]
On GenerateAnswerNode we have a piece of code that choses the correct parser for the alms (it need to be updated since most of the models support this methods, but that's not the issue now)
if self.node_config.get("schema", None) is not None: print("Schema is not None") if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)): self.llm_model = self.llm_model.with_structured_output( schema = self.node_config["schema"], method="function_calling") # json schema works only on specific models # default parser to empty function def output_parser(x): return x if is_basemodel_subclass(self.node_config["schema"]): print("Schema is a pydantic model") output_parser = dict format_instructions = "NA" else: print("llm is not Openai") output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"]) format_instructions = output_parser.get_format_instructions() else: output_parser = JsonOutputParser() format_instructions = output_parser.get_format_instructions()
I added the prints
What happens here is: the first iteration of SmartScraperGraph the LLM is correctly initialised and I get the expected output
Schema is not None Schema is a pydantic model
But the second time SmartScraperGraph is called and sub sequentially also GenerateAsnwerNode, I observe
Schema is not None llm is not Openai
That means there is a problem with the initialisation of the llm that get lost in the way (since SmartScraperGraph class is initialised just once) Wrong parser is selected and this ends up in errors and weird stuff happening.
@VinciGit00
I can add all the other models that should be able to use
with_strctured_output()
(although I cannot test them, I do not have keys for everyone, I'll just rely on documentation). But the initialisation issue is much rooted in the code and it would be difficult for me to fix that.@rjbks I'm guessing that with some specific pydantic version being used the "wrong parser" is able to parse all the OpenAi output, while the correct one should work for Pydantic, langchain Pydantic and Dict
For Google error, it is possible that adding
CahtGoogleGenerativeAI
here:if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
and fixing the initialisation will be enough to fix all errors
@LorenzoPaleari so this does not cover the openai error where it expects a string but gets a pydantic model then right?
@rjbks
I do not have keys for Gemini-Pro.
I have one fix in mind that can maybe work, if you are willing to try that for me.
In file scrapegraphai/nodes/generate_answer_node.py
on line 93:
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
modify it to:
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI, ChatGoogleGenAI)):
The import is:
from langchain_google_genai import ChatGoogleGenAI
It should break at the second iteration now.
FIX Why it should be better?
JsonOutputParser that is currently used for most of the models a part for ChatOpenAI and ChatMistralAI is actually not recommended from langchain.
with_structured_output()
instead is the way to go for correctly parsing the result.
I think that by changing the parser to the better one, we should see an improvement as soon as we fix also the initialisation error.
@LorenzoPaleari so this does not cover the openai error where it expects a string but gets a pydantic model then right?
It actually covers that error also.
Since the llm on the second iteration is not correctly instantiated, the if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
will fail and we end up using the wrong parser for ChatOpenAI, same wrong parser that was also causing the error at the beginning.
@LorenzoPaleari Opneai still throws same errors for both mini and 4o. For Google genai proi and flash function calling is now an issue:
Model google_genai/gemini-1.5-pro not found,
using default token size (8192)
Model google_genai/gemini-1.5-pro not found,
using default token size (8192)
--- Executing SearchInternet Node ---
Search Query: "Centro Universitário Franciscano" medical school UNIFRA 2014..2018
--- Executing GraphIterator Node with batchsize 16 ---
processing graph instances: 0%| | 0/3 [00:00<?, ?it/s]--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.scielo.br/j/reben/a/GZ9xDNvrHZNL3XjScnXHWmD/?format=pdf&lang=en) ---
--- Executing Fetch Node ---
--- (Fetching HTML from: https://pubmed.ncbi.nlm.nih.gov/25590204/) ---
--- Executing Fetch Node ---
--- (Fetching HTML from: https://unifra.academia.edu/SolangeFagan) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
processing graph instances: 0%| | 0/3 [00:00<?, ?it/s]
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
res = sg.run()
^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
raise e
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 72, in execute
state = asyncio.run(self._async_execute(state, batchsize))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 128, in _async_execute
answers = await tqdm.gather(
^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/tqdm/asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/tasks.py", line 631, in _wait_for_one
return f.result() # May raise f.exception().
^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/graph_iterator_node.py", line 117, in _async_run
return await asyncio.to_thread(graph.run)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
raise e
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 95, in execute
self.llm_model = self.llm_model.with_structured_output(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_google_genai/chat_models.py", line 1186, in with_structured_output
raise ValueError(f"Received unsupported arguments {kwargs}")
ValueError: Received unsupported arguments {'method': 'function_calling'}
@rjbks @VinciGit00 I was able to fix the error for both ChatOpenAI and GoogleGenAI (tested on gemini-1.5-pro).
I will open a pull request for this error. At the end it was related with the initialisation problem.
@rjbks @VinciGit00 I was able to fix the error for both ChatOpenAI and GoogleGenAI (tested on gemini-1.5-pro).
I will open a pull request for this error. At the end it was related with the initialisation problem.
@LorenzoPaleari
Is this https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/662
the PR?
@rjbks
https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/664
ok please guys update to the new beta
This issue still persists in the new release 1.19.0, on my windows machine (works on my mac), also on 1.19.0b9, but switching to 1.19.0b11 fixed it. Same setup, code, versions. text key is expected to hold string value.
Did a full env re-install to be sure it wasn't a version issue. Only reverting to 1.19.0b11 fixed it.
@LorenzoPaleari
It should work on the beta
This issue still persists in the new release 1.19.0, on my windows machine (works on my mac), also on 1.19.0b9, but switching to 1.19.0b11 fixed it. Same setup, code, versions. text key is expected to hold string value.
Did a full env re-install to be sure it wasn't a version issue. Only reverting to 1.19.0b11 fixed it.
@LorenzoPaleari
1.19.0 is behind 1.19.0-beta11, so if it works in beta11 it is fine. It should work also in beta10 where the fix was merged
Describe the bug Since version 1.14.0 (tested here on 1.15.0), SmartScraperGraph on OpenAI stopped working with a pydantic-related error:
To Reproduce Run the example: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py
I also tried to update to the latest langchain, but that didn't help. It could be langchain-related, as I can see similar issues (but not exactly the same) on their issues list. Or it's just a mix of pydantic versions.
Expected behavior It should succeed.
Screenshots
Desktop (please complete the following information):