Closed bezineb5 closed 6 days ago
Thanks for reporting this. I've had similar issues myself while working on a side project.
I managed to fix my problem by tinkering with the system prompt. I'll see if I can do the same here, and let you know.
can you share the code please? are you using a schema for the input?
I just used the example: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py
In my own code, I'm only using a schema for the output, nothing special for the input:
SmartScraperGraph(
prompt="This page contains a list of items, return the urls of the individual detail pages.",
source=url,
config=base_graph_config,
schema=bronze.UrlsList,
)
can you use langchain pydantic (from langchain_core.pydantic_v1 import BaseModel, Field, validator) instead of pydantic?
I get a similar error:
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/llm_scraper/scrape.py", line 182, in <module>
main()
File "/llm_scraper/scrape.py", line 177, in main
for url in _iterate_provider_list(provider_id, base_graph_config):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm_scraper/scrape.py", line 125, in _iterate_provider_list
result = script_creator_graph.run()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard
raise e
File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 129, in execute
answer = chain.invoke({"question": user_prompt})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
input = context.run(step.invoke, input, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 5092, in invoke
return self.bound.invoke(
^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 276, in invoke
self.generate_prompt(
File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 776, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 633, in generate
raise e
File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 623, in generate
self._generate_with_cache(
File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 845, in _generate_with_cache
result = self._generate(
^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/langchain_openai/chat_models/base.py", line 629, in _generate
response = self.root_client.beta.chat.completions.parse(**payload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/openai/resources/beta/chat/completions.py", line 118, in parse
response_format=_type_to_response_format(response_format),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/openai/lib/_parsing/_completions.py", line 245, in type_to_response_format_param
raise TypeError(f"Unsupported response_format type - {response_format}")
TypeError: Unsupported response_format type - <class 'llm_scraper.models.bronze.UrlsList'>
Did you manage to reproduce using the example?
can I have the code?
please try to use from langchain_core.pydantic_v1 import BaseModel, Field, validator instead of from pydantic import BaseModel, Field
please try to use from langchain_core.pydantic_v1 import BaseModel, Field, validator instead of from pydantic import BaseModel, Field
I did, the result is here: https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/598#issuecomment-2318667354
can I have the code?
Yes, it is available here: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py
It really looks like a langchain issue, or a misconfiguration. Did anything change in the chain/output parsing?
We changed to langchain pydantic
Yeah but langchain is fine with pydantic v2, and I tested without success pydantic v1.
However, I investigated and I found that there is an issue inGenerateAnswerNode
: both self.llm_model.with_structured_output
and JsonOutputParser
are used, but with_structured_output already adds an output parser: https://github.com/langchain-ai/langchain/blob/fabd3295fabb4c79fedb4dbbe725a308658ef8d8/libs/partners/openai/langchain_openai/chat_models/base.py#L1414C25-L1414C36
So it's effectively trying to parse a Pydantic object.
So you can either ask the llm class to return a structured output: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L103 Or provide an output parser: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L133C47-L133C47
(side note: it seems that json_schema
mode is only implemented for openai chats, not for mistral or others - but langchain will ignore it silently: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L100)
I confirm that commenting out the block here https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L103 fixed the issue
I can confirm using from langchain_core.pydantic_v1 import BaseModel, Field, validator
running the above example
https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/pre/beta/examples/openai/smart_scraper_schema_openai.py
results in
--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
File "/Users/gareth/Code/Python/citify/test.py", line 48, in <module>
result = smart_scraper_graph.run()
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
return self._execute_standard(initial_state)
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard
raise e
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard
result = current_node.execute(state)
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 134, in execute
answer = chain.invoke({"question": user_prompt})
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
input = context.run(step.invoke, input, config)
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/runnables/base.py", line 5092, in invoke
return self.bound.invoke(
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 276, in invoke
self.generate_prompt(
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 776, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 633, in generate
raise e
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 623, in generate
self._generate_with_cache(
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 845, in _generate_with_cache
result = self._generate(
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_openai/chat_models/base.py", line 629, in _generate
response = self.root_client.beta.chat.completions.parse(**payload)
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/openai/resources/beta/chat/completions.py", line 118, in parse
response_format=_type_to_response_format(response_format),
File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/openai/lib/_parsing/_completions.py", line 255, in type_to_response_format_param
raise TypeError(f"Unsupported response_format type - {response_format}")
TypeError: Unsupported response_format type - <class '__main__.Projects'>
hi, we will fix in the next hours
In Version 1.16.0b2 is still broken.
There are two possible fix for this error, I tried them locally and they both work.
FIX 1 GenerateAnswerNode - Line 89
if self.node_config.get("schema", None) is not None:
# if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
# self.llm_model = self.llm_model.with_structured_output(
# schema = self.node_config["schema"],
# method="json_schema")
# else:
output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
else:
output_parser = JsonOutputParser()
The fix is commenting out the with_strctured_output
part.
Commenting out this part I tested smart_scraper_schema_openai
both with pydantic
and langchain_core.pydantic_v1
and it works.
FIX 2 GenerateAnswerNode L89
if self.node_config.get("schema", None) is not None:
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
self.llm_model = self.llm_model.with_structured_output(
schema = self.node_config["schema"],
method="json_schema")
else:
output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
format_instructions = output_parser.get_format_instructions()
else:
output_parser = JsonOutputParser()
format_instructions = output_parser.get_format_instructions()
Firstly, format instructions are not necessary anymore when using with_strcutured_output
, so the instruction must be moved inside.
Than every invocation of the llm should be modified as follows:
prompt = PromptTemplate(
template=template_no_chunks_prompt ,
input_variables=["question"],
partial_variables={"context": doc})
chain = prompt | self.llm_model # | output_parser
answer = chain.invoke({"question": user_prompt})
removed the output parser and the format instructions.
With this solution there are 3 problems:
projects=[Project(title='Rotary Pendulum RL', description='Open Source project aimed at controlling a real life rotary pendulum using RL algorithms.'), Project(title='DQN Implementation from scratch', description='Developed a Deep Q-Network algorithm to train a simple and double pendulum.'), Project(title='Multi Agents HAED', description='University project which focuses on simulating a multi-agent system to perform environment mapping. Agents, equipped with sensors, explore and record their surroundings, considering uncertainties in their readings.'), Project(title='Wireless ESC for Modular Drones', description='Modular drone architecture proposal and proof of concept. The project received maximum grade.')]
pydantic
, langchain_core.pydantic_v1
do not work here.Hi please update to the new version
Can confirm the same error persists in 1.16.0
:
Traceback (most recent call last):
File "/workspaces/Marsala/marsala.py", line 47, in <module>
result = smart_scraper_graph.run()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 184, in _execute_standard
raise e
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 168, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 134, in execute
answer = chain.invoke({"question": user_prompt})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
input = context.run(step.invoke, input, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 5092, in invoke
return self.bound.invoke(
^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 277, in invoke
self.generate_prompt(
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 777, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 634, in generate
raise e
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 624, in generate
self._generate_with_cache(
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 846, in _generate_with_cache
result = self._generate(
^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_openai/chat_models/base.py", line 652, in _generate
response = self.root_client.beta.chat.completions.parse(**payload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/openai/resources/beta/chat/completions.py", line 118, in parse
response_format=_type_to_response_format(response_format),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/openai/lib/_parsing/_completions.py", line 255, in type_to_response_format_param
raise TypeError(f"Unsupported response_format type - {response_format}")
TypeError: Unsupported response_format type - <class '__main__.Companies'>
Code:
import scrapegraphai
import json
from typing import List
#from pydantic import BaseModel, Field
from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict
from scrapegraphai.utils import prettify_exec_info
from scrapegraphai.graphs import SmartScraperGraph, OmniScraperGraph
class Settings(BaseSettings):
openai_api_key: SecretStr
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8')
class Company(BaseModel):
company: str = Field(description="Company name")
description: str = Field(description="Company description")
email: str = Field(description="Company email")
class Companies(BaseModel):
companies: List[Company]
if __name__ == "__main__":
# Define the configuration for the scraping pipeline
settings = Settings()
graph_config = {
"llm": {
"api_key": settings.openai_api_key.get_secret_value(),
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": True,
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="Find some information about what the company does, the name and a contact email.",
source="https://scrapegraphai.com/",
schema=Companies,
config=graph_config
)
# Run the pipeline
result = smart_scraper_graph.run()
print(result)
#print(json.dumps(result, indent=4))
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
In Version 1.16.0b2 is still broken.
There are two possible fix for this error, I tried them locally and they both work.
FIX 1 GenerateAnswerNode - Line 89
if self.node_config.get("schema", None) is not None: # if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)): # self.llm_model = self.llm_model.with_structured_output( # schema = self.node_config["schema"], # method="json_schema") # else: output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"]) else: output_parser = JsonOutputParser()
The fix is commenting out the
with_strctured_output
part. Commenting out this part I testedsmart_scraper_schema_openai
both withpydantic
andlangchain_core.pydantic_v1
and it works.FIX 2 GenerateAnswerNode L89
if self.node_config.get("schema", None) is not None: if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)): self.llm_model = self.llm_model.with_structured_output( schema = self.node_config["schema"], method="json_schema") else: output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"]) format_instructions = output_parser.get_format_instructions() else: output_parser = JsonOutputParser() format_instructions = output_parser.get_format_instructions()
Firstly, format instructions are not necessary anymore when using
with_strcutured_output
, so the instruction must be moved inside.Than every invocation of the llm should be modified as follows:
prompt = PromptTemplate( template=template_no_chunks_prompt , input_variables=["question"], partial_variables={"context": doc}) chain = prompt | self.llm_model # | output_parser answer = chain.invoke({"question": user_prompt})
removed the output parser and the format instructions.
With this solution there are 3 problems:
- Doubling all the prompts to have versions with and without format instructions
- It needs a custom parser at the end to use the answer, output parser from langchain cannot be used with this mode, thus the answer is a well formatted string that is missing all the json annotation
projects=[Project(title='Rotary Pendulum RL', description='Open Source project aimed at controlling a real life rotary pendulum using RL algorithms.'), Project(title='DQN Implementation from scratch', description='Developed a Deep Q-Network algorithm to train a simple and double pendulum.'), Project(title='Multi Agents HAED', description='University project which focuses on simulating a multi-agent system to perform environment mapping. Agents, equipped with sensors, explore and record their surroundings, considering uncertainties in their readings.'), Project(title='Wireless ESC for Modular Drones', description='Modular drone architecture proposal and proof of concept. The project received maximum grade.')]
- It only works when creating the schema with
pydantic
,langchain_core.pydantic_v1
do not work here.
Hi, I wanted to let you know that this comment is still valid for version 1.17.0, the latest one on branch pre/beta.
I would like the second one, can you implement it please?
I'm implementing it.
Ok thx. Is ok now?
I tested it on my end with and without schema and it works.
Schema can be implemented with pydantic
langchain_core.pydantic_v1
or TypedDict
.
perfect. so I will close the issue. if someone still has problems I will update it
Hi @VinciGit00, the issue persists
Hi, did you tested with the 1.17.0b5 version?
Hi, did you tested with the 1.17.0b5 version?
I was using 1.16. It is working fine on 1.17.0b5, thanks!
I'm happy to hear that
This issue persists in 1.18.1
Relevant packages: langchain-openai==0.1.23 openai==1.44.1 pydantic==2.9.1 pydantic_core==2.23.3 scrapegraphai==1.18.1
Error: File /opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py:341, in BaseModel.init(pydantic_self__, **data) 339 values, fields_set, validation_error = validate_model(pydantic_self.class, data) 340 if validation_error: --> 341 raise validation_error 342 try: 343 object_setattr(__pydantic_self, 'dict', values)
ValidationError: 1 validation error for Generation text str type expected (type=type_error.str)
Using from langchain_core.pydantic_v1 import BaseModel, Field
, raises an error from the openai module. Some digging there shows openai client no longer supports older versions of pydantic.
EDIT1:
Problem seems to arise from langchain_core/load/serializable.py:113
, where it initializes the 'langchain_core.outputs.generation.Generation' pydantic model with {'text': MyScrapeGraphAIPydanticModel(...)} and this call in the BaseModel init function (pydantic/v1/main.py:339):
values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
yields:
{'generation_info': None, 'type': 'Generation'}, {'text'}, Error
Except the 'text' field is expected to be a str, but it is my pydantic model passed in as the schema arg.
EDIT2
Seems to be fixed in v1.19.0-beta.2
v1.18.1 do not contain the fixed version
It is present in v1.19.0-beta.1+ If it works there it should be fine
v1.18.1 do not contain the fixed version
It is present in v1.19.0-beta.1+ If it works there it should be fine
While it seems solved in the JSONScraperGraph with v1.19.0-beta.1+, the SearchGraph (and presumable underlying SmartScraperGraph) still have this issue. Same setup and error as my previous post above except upodated to scrapegraphai-1.19.0b7.
EDIT1#
Just tried the SearchGraph example and google models do not work, used the code from examples https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/google_genai/search_graph_schema_gemini.py
. Returns this error:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/utils/tokenizer.py", line 26, in num_tokens_calculus raise NotImplementedError(f"There is no tokenization implementation for model '{llm_model}'")
While this works for JSONScraperGraph, it does not work for SearchGraph
hi please update to the new beta
hi please update to the new beta
Just updated to beta you released an hour ago. Google genai model now getting:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 90, in parse_result
raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output:
Here is the text string if failed on (escapes are mine):
\`\`\`json
{
"matches": [
{
"name": "JSS Medical College - India",
"is_med_school": true,
"city": "Mysore",
"state": "Karnataka",
"country": "India"
}
]
}
\`\`\`
This looks correct to me, the langchain parse_json_markdown function seems to have something to strip the markdown, but still throws json decode error.
OpenAI model still throws same error as before:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__
raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
str type expected (type=type_error.str)
EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.
EDIT2: reducing the max results and trying google genai again shows the same error as I originally posted on this thread. Here is the error:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: {
"matches": [
{
"name": "JSS Medical College",
"is_med_school": True,
"city": "Mysuru",
"state": "Karnataka",
"country": "India"
}
]
}
Logging the following values (__pydantic_self.class__, data, values, fields_set) in the BaseModel class init method where one of the errors happen, shows this:
<class 'langchain_core.outputs.chat_generation.ChatGeneration'> {'message': AIMessage(content='{\n "matches": [\n {\n "name": "JSS Medical College",\n "is_med_school": True,\n "city": "Mysuru",\n "state": "Karnataka",\n "country": "India"\n }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'text': '{\n "matches": [\n {\n "name": "JSS Medical College",\n "is_med_school": True,\n "city": "Mysuru",\n "state": "Karnataka",\n "country": "India"\n }\n ]\n}', 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content='{\n "matches": [\n {\n "name": "JSS Medical College",\n "is_med_school": True,\n "city": "Mysuru",\n "state": "Karnataka",\n "country": "India"\n }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'message'}
Looks like it is expecting the message value to be a string and it is getting AIMessage instead.
Then looking in the output_parser/json.py from langchain, it is logging this as the result[0].text strings:
Result text: \```json
{
"matches": [
{
"name": "JSS Medical College",
"is_med_school": true,
"city": "Mysuru",
"state": "Karnataka",
"country": "India"
}
]
}
\```
Result text: {
"matches": [
{
"name": "JSS Medical College",
"is_med_school": True,
"city": "Mysuru",
"state": "Karnataka",
"country": "India"
}
]
} <class 'str'>
Where the final version does not have the surrounding markdown ticks, but the json parser looks like it has this covered in the try/except block.
EDIT3:
Seems like "True" is not properly json encoded. Maybe instead of a bool type I should use a literal of "true" or "false"
EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.
Can you share the code you used for this? I want to dug deeper into the error
EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.
Can you share the code you used for this? I want to dug deeper into the error
class MatchedSchool(BaseModel):
name: str = Field(description="The name of the input candidate medical school.")
is_med_school: bool = Field(description="Whether or not the input school is actually a medical school or has a medical program.")
city: Optional[str] = Field(description="The city where the matched medical school campus/program facility is located, if available.")
state: Optional[str] = Field(description="The state (or if international, the geographic/political region within the country) where the matched medical school campus/program facility is located, if available.")
country: Optional[str] = Field(description="The country where the matched medical school campus/program facility is located, if available.")
class Matches(BaseModel):
matches: List[MatchedSchool]
sg = SearchGraph(
prompt=f'Verify that "JSS Medical College - India" is a medical school and provide location information if possible.',
config={
"llm": {
"api_key": os.getenv('GOOGLE_GENAI_API_KEY'), #os.getenv('OPENAI_APIKEY'),
"model": "google_genai/gemini-pro", #"openai/gpt-4o-2024-08-06",
"temperature": 0,
},
# "cache_path": "./scrapegraph_cache",
"verbose": False,
"max_results": 3,
},
schema=Matches
)
res = sg.run()
print(json.dumps(res, indent=2))
print(json.dumps(sg.get_execution_info(), indent=2))
Seems like reducing the max results and fixing a type hint solves this. Apparently bool types seem to get converted to the python version title cased True or False which causes json parsing to fail.
Switching the typing to Literal["true", "false"]
or Literal[0, 1]
solves this. The reducing max_results clearly is not the answer here, there must be something else happening there.
After all you have discovered, the error is still present in both OpenAi and Google versions?
After all you have discovered, the error is still present in both OpenAi and Google versions?
It is working for me in 1.19beta with OpenAI
After all you have discovered, the error is still present in both OpenAi and Google versions?
There seem to be 2, possibly unrelated, issues here. First is boolean json representations throw an error as they are represented in the python syntax (title case like True instead of true). Second seems to be something that is most likely specific to either the max_results number of pages searched, or with a specific chunk of data extracted from a specific page in those search results in my code above. These 2 errors, and not the specific error the OP started the thread with are the issues here. I suppose this belongs in another issue/thread. Change the max_results config value to 20 and see what happens with my specific query/code above. It throws for me consistently.
hi please update to the new beta
Just updated to beta you released an hour ago. Google genai model now getting:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 90, in parse_result raise OutputParserException(msg, llm_output=text) from e langchain_core.exceptions.OutputParserException: Invalid json output:
Here is the text string if failed on (escapes are mine):
\`\`\`json { "matches": [ { "name": "JSS Medical College - India", "is_med_school": true, "city": "Mysore", "state": "Karnataka", "country": "India" } ] } \`\`\`
This looks correct to me, the langchain parse_json_markdown function seems to have something to strip the markdown, but still throws json decode error.
OpenAI model still throws same error as before:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__ raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation text str type expected (type=type_error.str)
EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.
EDIT2: reducing the max results and trying google genai again shows the same error as I originally posted on this thread. Here is the error:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result raise OutputParserException(msg, llm_output=text) from e langchain_core.exceptions.OutputParserException: Invalid json output: { "matches": [ { "name": "JSS Medical College", "is_med_school": True, "city": "Mysuru", "state": "Karnataka", "country": "India" } ] }
Logging the following values (pydantic_self.class, data, values, fields_set) in the BaseModel class init method where one of the errors happen, shows this:
<class 'langchain_core.outputs.chat_generation.ChatGeneration'> {'message': AIMessage(content='{\n "matches": [\n {\n "name": "JSS Medical College",\n "is_med_school": True,\n "city": "Mysuru",\n "state": "Karnataka",\n "country": "India"\n }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'text': '{\n "matches": [\n {\n "name": "JSS Medical College",\n "is_med_school": True,\n "city": "Mysuru",\n "state": "Karnataka",\n "country": "India"\n }\n ]\n}', 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content='{\n "matches": [\n {\n "name": "JSS Medical College",\n "is_med_school": True,\n "city": "Mysuru",\n "state": "Karnataka",\n "country": "India"\n }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'message'}
Looks like it is expecting the message value to be a string and it is getting AIMessage instead.
Then looking in the output_parser/json.py from langchain, it is logging this as the result[0].text strings:
Result text: \```json { "matches": [ { "name": "JSS Medical College", "is_med_school": true, "city": "Mysuru", "state": "Karnataka", "country": "India" } ] } \``` Result text: { "matches": [ { "name": "JSS Medical College", "is_med_school": True, "city": "Mysuru", "state": "Karnataka", "country": "India" } ] } <class 'str'>
Where the final version does not have the surrounding markdown ticks, but the json parser looks like it has this covered in the try/except block.
EDIT3:
Seems like "True" is not properly json encoded. Maybe instead of a bool type I should use a literal of "true" or "false"
In the last example here you can see some funky stuff happening. Looks like the original markdown wrapped JSON is correct with boolean value as "true". Then the final result is a python dict object, but a string representation (you can see the <class 'str'> after the logged text).
Some odd behavior being observed here. Using openai (4o and 4o-mini). I am seeing it not adhere to the pydantic model at all and no validation errors are occurring. For example:
class MatchedSchool(BaseModel):
name: str = Field(description="The name of the input candidate medical school.")
alternate_names: List[str] = Field(description="A list of alternate names referencing this school. Could be abbreviations, or fully spelled out names, as well as names of individual departments responsible for the Medical Curriculum within the school.")
is_med_school: Literal["true", "false"] = Field(description="Whether or not the input school is actually a medical school or has a medical program.")
city: Optional[str] = Field(description="The city where the matched medical school campus/program facility is located, if available.")
state: Optional[str] = Field(description="The state (or if international, the geographic/political region within the country) where the matched medical school campus/program facility is located, if available.")
country: Optional[str] = Field(description="The country where the matched medical school campus/program facility is located, if available.")
source: str = Field(description="Source URL where this match was found.")
class Matches(BaseModel):
matches: List[MatchedSchool] = Field(description="A list of matched medical schools.")
I am getting this json back with no validation errors:
{
"is_medical_school": true,
"location": {
"city": "New York",
"state": "New York",
"country": "United States"
},
"source": "https://en.wikipedia.org/w/index.php?title=New_York_College_of_Podiatric_Medicine&oldid=1219437275",
"sources": [
"https://en.wikipedia.org/wiki/New_York_College_of_Podiatric_Medicine"
]
}
Then other times, there is a validation error, but I cannot reproduce, and temp is set to 0. 🤷‍♂️
Same setup as last post, but with GPT4o-mini. This time it is the original issue: in pydantic/v1/main.py
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
str type expected (type=type_error.str)
Where pydantic validation results are:
<class 'langchain_core.outputs.generation.Generation'> {'text': Matches(matches=[MatchedSchool(name='Centro Universitário Franciscano', alternate_names=['UNIFRA'], is_med_school='true', city='Santa Maria', state='Rio Grande do Sul', country='Brazil', source='https://www.unifra.br')])} {'generation_info': None, 'type': 'Generation'} {'text'}
This time, we can see it is expecting the 'text' key to be a string but it is receiving the pydantic model instead.
Ok write me the code please
Ok write me the code please
v1.19.0b8
class MatchedSchool(BaseModel):
name: str = Field(description="The name of the input candidate medical school.")
alternate_names: List[str] = Field(description="A list of alternate names referencing this school. Could be abbreviations, or fully spelled out names, as well as names of individual departments responsible for the Medical Curriculum within the school.")
is_med_school: Literal["true", "false"] = Field(description="Whether or not the input school is actually a medical school or has a medical program.")
city: Optional[str] = Field(description="The city where the matched medical school campus/program facility is located, if available.")
state: Optional[str] = Field(description="The state (or if international, the geographic/political region within the country) where the matched medical school campus/program facility is located, if available.")
country: Optional[str] = Field(description="The country where the matched medical school campus/program facility is located, if available.")
source: str = Field(description="Source URL where this match was found.")
class Matches(BaseModel):
matches: List[MatchedSchool] = Field(description="A list of matched medical schools.")
sg = SearchGraph(
prompt=f'Search for information for the following medical school: "Centro Universitário Franciscano (UNIFRA) (2014 - 2018)"',
config={
"llm": {
"api_key": os.getenv('OPENAI_APIKEY'),
"model": "openai/gpt-4o-mini-2024-07-18",
"temperature": 0,
},
# "cache_path": "./scrapegraph_cache",
"verbose": False,
"max_results": 3,
},
schema=Matches
)
res = sg.run()
print(json.dumps(res, indent=2))
print(json.dumps(sg.get_execution_info(), indent=2))
@VinciGit00
Which version are you testing on? I'm using OpenAI to test.
I discovered a weird bug in 1.1.9.0-beta8 that can be connected to this issue.
On GenerateAnswerNode we have a piece of code that choses the correct parser for the alms (it need to be updated since most of the models support this methods, but that's not the issue now)
if self.node_config.get("schema", None) is not None:
print("Schema is not None")
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
self.llm_model = self.llm_model.with_structured_output(
schema = self.node_config["schema"],
method="function_calling") # json schema works only on specific models
# default parser to empty function
def output_parser(x):
return x
if is_basemodel_subclass(self.node_config["schema"]):
print("Schema is a pydantic model")
output_parser = dict
format_instructions = "NA"
else:
print("llm is not Openai")
output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
format_instructions = output_parser.get_format_instructions()
else:
output_parser = JsonOutputParser()
format_instructions = output_parser.get_format_instructions()
I added the prints
What happens here is: the first iteration of SmartScraperGraph the LLM is correctly initialised and I get the expected output
Schema is not None
Schema is a pydantic model
But the second time SmartScraperGraph is called and sub sequentially also GenerateAsnwerNode, I observe
Schema is not None
llm is not Openai
That means there is a problem with the initialisation of the llm that get lost in the way (since SmartScraperGraph class is initialised just once) Wrong parser is selected and this ends up in errors and weird stuff happening.
@VinciGit00
I can add all the other models that should be able to use with_strctured_output()
(although I cannot test them, I do not have keys for everyone, I'll just rely on documentation).
But the initialisation issue is much rooted in the code and it would be difficult for me to fix that.
@rjbks I'm guessing that with some specific pydantic version being used the "wrong parser" is able to parse all the OpenAi output, while the correct one should work for Pydantic, langchain Pydantic and Dict
For Google error, it is possible that adding CahtGoogleGenerativeAI
here:
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
and fixing the initialisation will be enough to fix all errors
Which version are you testing on? I'm using OpenAI to test.
I discovered a weird bug in 1.1.9.0-beta8 that can be connected to this issue.
On GenerateAnswerNode we have a piece of code that choses the correct parser for the alms (it need to be updated since most of the models support this methods, but that's not the issue now)
if self.node_config.get("schema", None) is not None: print("Schema is not None") if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)): self.llm_model = self.llm_model.with_structured_output( schema = self.node_config["schema"], method="function_calling") # json schema works only on specific models # default parser to empty function def output_parser(x): return x if is_basemodel_subclass(self.node_config["schema"]): print("Schema is a pydantic model") output_parser = dict format_instructions = "NA" else: print("llm is not Openai") output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"]) format_instructions = output_parser.get_format_instructions() else: output_parser = JsonOutputParser() format_instructions = output_parser.get_format_instructions()
I added the prints
What happens here is: the first iteration of SmartScraperGraph the LLM is correctly initialised and I get the expected output
Schema is not None Schema is a pydantic model
But the second time SmartScraperGraph is called and sub sequentially also GenerateAsnwerNode, I observe
Schema is not None llm is not Openai
That means there is a problem with the initialisation of the llm that get lost in the way (since SmartScraperGraph class is initialised just once) Wrong parser is selected and this ends up in errors and weird stuff happening.
@VinciGit00
@LorenzoPaleari
So I am using v1.19.0b8, the version you mention. And it looks like it is getting the correct json from the llm, then parsing it to python dict then taking the python dict and stringifying it and wrapping within markdown. Here is one of the validation errors:
langchain_core.exceptions.OutputParserException: Invalid json output: \```json
{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'false', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}
\```
(I escaped the first markdown tick so it wouldn't format it)
So I am using v1.19.0b8, the version you mention. And it looks like it is getting the correct json from the llm, then parsing it to python dict then taking the python dict and stringifying it and wrapping within markdown. Here is one of the validation errors:
langchain_core.exceptions.OutputParserException: Invalid json output: ```json {'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'false', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]} ``` (I escaped the first markdown tick so it wouldn't format it)
Can you tell in which iteration is this happening?
(I removed all the asyncio code to understand better the sequence)
"verbose": True
to the config of the SearchGraph.scrapegraphai/nodes/graph_iterator_node.py
with:
"""
GraphIterator Module
"""
import asyncio
import copy
from typing import List, Optional
from tqdm.asyncio import tqdm
from ..utils.logging import get_logger
from .base_node import BaseNode
DEFAULT_BATCHSIZE = 1
class GraphIteratorNode(BaseNode): """ A node responsible for instantiating and running multiple graph instances in parallel. It creates as many graph instances as the number of elements in the input list.
Attributes:
verbose (bool): A flag indicating whether to show print statements during execution.
Args:
input (str): Boolean expression defining the input keys needed from the state.
output (List[str]): List of output keys to be updated in the state.
node_config (dict): Additional configuration for the node.
node_name (str): The unique identifier name for the node, defaulting to "Parse".
"""
def __init__(
self,
input: str,
output: List[str],
node_config: Optional[dict] = None,
node_name: str = "GraphIterator",
):
super().__init__(node_name, "node", input, output, 2, node_config)
self.verbose = (
False if node_config is None else node_config.get("verbose", False)
)
def execute(self, state: dict) -> dict:
"""
Executes the node's logic to instantiate and run multiple graph instances in parallel.
Args:
state (dict): The current state of the graph. The input keys will be used to fetch
the correct data from the state.
Returns:
dict: The updated state with the output key c
ontaining the results of the graph instances.
Raises:
KeyError: If the input keys are not found in the state,
indicating that thenecessary information for running
the graph instances is missing.
"""
batchsize = self.node_config.get("batchsize", DEFAULT_BATCHSIZE)
self.logger.info(
f"--- Executing {self.node_name} Node with batchsize {batchsize} ---"
)
state = self._async_execute(state, batchsize)
return state
def _async_execute(self, state: dict, batchsize: int) -> dict:
"""asynchronously executes the node's logic with multiple graph instances
running in parallel, using a semaphore of some size for concurrency regulation
Args:
state: The current state of the graph.
batchsize: The maximum number of concurrent instances allowed.
Returns:
The updated state with the output key containing the results
aggregated out of all parallel graph instances.
Raises:
KeyError: If the input keys are not found in the state.
"""
input_keys = self.get_input_keys(state)
input_data = [state[key] for key in input_keys]
user_prompt = input_data[0]
urls = input_data[1]
graph_instance = self.node_config.get("graph_instance", None)
if graph_instance is None:
raise ValueError("graph instance is required for concurrent execution")
if "graph_depth" in graph_instance.config:
graph_instance.config["graph_depth"] += 1
else:
graph_instance.config["graph_depth"] = 1
graph_instance.prompt = user_prompt
participants = []
for url in urls:
instance = copy.copy(graph_instance)
instance.source = url
if url.startswith("http"):
instance.input_key = "url"
participants.append(instance)
futures = []
for graph in participants:
print(f"Running graph instance for {graph.source}")
futures.append(graph.run())
state.update({self.output[0]: futures})
return state
S0 this time, with the same settings, openai gpt-40-mini, I am getting the original error:
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 342, in __init__
raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
str type expected (type=type_error.str)
Where text value is a pydantic model instead of str.
Then with gemini-1.5-pro (but not flash), the previous error where it looked like a python dict stringified and wrapped in markdown (where there were single quotes):
<class 'langchain_core.outputs.chat_generation.ChatGeneration'> {'message': AIMessage(content="```json\n{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}\n```\n", response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-1e086766-9ec8-4569-9490-ab1d6c6559ab-0', usage_metadata={'input_tokens': 239, 'output_tokens': 76, 'total_tokens': 315})} {'text': "```json\n{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}\n```\n", 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content="\```json\n{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}\n\```\n", response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-1e086766-9ec8-4569-9490-ab1d6c6559ab-0', usage_metadata={'input_tokens': 239, 'output_tokens': 76, 'total_tokens': 315})} {'message'}
Traceback (most recent call last):
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 84, in parse_result
return parse_json_markdown(text)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 147, in parse_json_markdown
return _parse_json(json_str, parser=parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 163, in _parse_json
return parser(json_str)
^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 118, in parse_partial_json
return json.loads(s, strict=strict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/json/__init__.py", line 359, in loads
return cls(**kw).decode(s)
^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
res = sg.run()
^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
raise e
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/merge_answers_node.py", line 102, in execute
answer = merge_chain.invoke({"user_prompt": user_prompt})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
input = context.run(step.invoke, input, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 183, in invoke
return self._call_with_config(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config
context.run(
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 398, in call_func_with_variable_args
return func(input, **kwargs) # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 184, in <lambda>
lambda inner_input: self.parse_result(
^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: \```json
{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}
\```
You can see an earlier iteration with correct formatting:
{'message': AIMessage(content='\```json\n{"matches": [{"name": "Centro Universitário Franciscano (UNIFRA)", "alternate_names": ["UNIFRA"], "is_med_school": "NA", "city": "NA", "state": "NA", "country": "NA", "source": "https://unifra.academia.edu/"}]}\n\```', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-b0038c3f-b4d4-4d32-91bb-db9b3ecf1f1d-0', usage_metadata={'input_tokens': 8087, 'output_tokens': 76, 'total_tokens': 8163})} {'text': '\```json\n{"matches": [{"name": "Centro Universitário Franciscano (UNIFRA)", "alternate_names": ["UNIFRA"], "is_med_school": "NA", "city": "NA", "state": "NA", "country": "NA", "source": "https://unifra.academia.edu/"}]}\n\```', 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content='\```json\n{"matches": [{"name": "Centro Universitário Franciscano (UNIFRA)", "alternate_names": ["UNIFRA"], "is_med_school": "NA", "city": "NA", "state": "NA", "country": "NA", "source": "https://unifra.academia.edu/"}]}\n\```', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-b0038c3f-b4d4-4d32-91bb-db9b3ecf1f1d-0', usage_metadata={'input_tokens': 8087, 'output_tokens': 76, 'total_tokens': 8163})}
Then later that gets turned into a python dict then stringified from python into markdown
@LorenzoPaleari
Output from above with gemini 1.5 pro (openai 4o mini still causes original error where expects string but gets pydantic model):
Model google_genai/gemini-1.5-pro not found,
using default token size (8192)
Model google_genai/gemini-1.5-pro not found,
using default token size (8192)
--- Executing SearchInternet Node ---
Search Query: "Centro Universitário Franciscano (UNIFRA)" medical school
--- Executing GraphIterator Node with batchsize 1 ---
Running graph instance for https://www.nanoorbit.com/nanotechnology-companies/centro-universitario-franciscano-unifra.html
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.nanoorbit.com/nanotechnology-companies/centro-universitario-franciscano-unifra.html) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Running graph instance for https://unifra.academia.edu/Departments/Potosi/Documents
--- Executing Fetch Node ---
--- (Fetching HTML from: https://unifra.academia.edu/Departments/Potosi/Documents) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Running graph instance for https://www.scielo.br/j/ean/a/JFLxF47FBSykSPfJGY8DBgj/?format=pdf&lang=en
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.scielo.br/j/ean/a/JFLxF47FBSykSPfJGY8DBgj/?format=pdf&lang=en) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
--- Executing MergeAnswers Node ---
Traceback (most recent call last):
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 84, in parse_result
return parse_json_markdown(text)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 147, in parse_json_markdown
return _parse_json(json_str, parser=parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 163, in _parse_json
return parser(json_str)
^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 118, in parse_partial_json
return json.loads(s, strict=strict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/json/__init__.py", line 359, in loads
return cls(**kw).decode(s)
^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
res = sg.run()
^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
raise e
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/merge_answers_node.py", line 102, in execute
answer = merge_chain.invoke({"user_prompt": user_prompt})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
input = context.run(step.invoke, input, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 183, in invoke
return self._call_with_config(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config
context.run(
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 398, in call_func_with_variable_args
return func(input, **kwargs) # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 184, in <lambda>
lambda inner_input: self.parse_result(
^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: ```json
{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}
\```
@rjbks
Output from above with gemini 1.5 pro (openai 4o mini still causes original error where expects string but gets pydantic model):
Thank you! Can you also share full output of gpt-4o-mini?
Describe the bug Since version 1.14.0 (tested here on 1.15.0), SmartScraperGraph on OpenAI stopped working with a pydantic-related error:
To Reproduce Run the example: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py
I also tried to update to the latest langchain, but that didn't help. It could be langchain-related, as I can see similar issues (but not exactly the same) on their issues list. Or it's just a mix of pydantic versions.
Expected behavior It should succeed.
Screenshots
Desktop (please complete the following information):