ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.47k stars 1.18k forks source link

[1.14.0+] pydantic ValidationError with SmartScraperGraph #598

Closed bezineb5 closed 6 days ago

bezineb5 commented 3 weeks ago

Describe the bug Since version 1.14.0 (tested here on 1.15.0), SmartScraperGraph on OpenAI stopped working with a pydantic-related error:

/venv/lib/python3.12/site-packages/google_crc32c/__init__.py:29: RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension                                                                                        warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/llm_scraper/test_bug.py", line 50, in <module>                                                                                                        result = smart_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run                                                      self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute                                                           return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard                                                 raise e
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard                                                 result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 129, in execute                                                  answer = chain.invoke({"question": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke                                                             input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 192, in invoke                                                         return self._call_with_config(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config                                                  context.run(
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 397, in call_func_with_variable_args                                      return func(input, **kwargs)  # type: ignore[call-arg]
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 193, in <lambda>                                                       lambda inner_input: self.parse_result([Generation(text=inner_input)]),
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/load/serializable.py", line 113, in __init__                                                         super().__init__(*args, **kwargs)
  File "/venv/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__                                                                         raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)

To Reproduce Run the example: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py

I also tried to update to the latest langchain, but that didn't help. It could be langchain-related, as I can see similar issues (but not exactly the same) on their issues list. Or it's just a mix of pydantic versions.

Expected behavior It should succeed.

Screenshots

Desktop (please complete the following information):

f-aguzzi commented 3 weeks ago

Thanks for reporting this. I've had similar issues myself while working on a side project.

I managed to fix my problem by tinkering with the system prompt. I'll see if I can do the same here, and let you know.

VinciGit00 commented 3 weeks ago

can you share the code please? are you using a schema for the input?

bezineb5 commented 3 weeks ago

I just used the example: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py

In my own code, I'm only using a schema for the output, nothing special for the input:

SmartScraperGraph(
        prompt="This page contains a list of items, return the urls of the individual detail pages.",
        source=url,
        config=base_graph_config,
        schema=bronze.UrlsList,
    )
VinciGit00 commented 3 weeks ago

can you use langchain pydantic (from langchain_core.pydantic_v1 import BaseModel, Field, validator) instead of pydantic?

bezineb5 commented 3 weeks ago

I get a similar error:

--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/llm_scraper/scrape.py", line 182, in <module>
    main()
  File "/llm_scraper/scrape.py", line 177, in main
    for url in _iterate_provider_list(provider_id, base_graph_config):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm_scraper/scrape.py", line 125, in _iterate_provider_list
    result = script_creator_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard
    raise e
  File "/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 129, in execute
    answer = chain.invoke({"question": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
    input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 5092, in invoke
    return self.bound.invoke(
           ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 276, in invoke
    self.generate_prompt(
  File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 776, in generate_prompt
    return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 633, in generate
    raise e
  File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 623, in generate
    self._generate_with_cache(
  File "/venv/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 845, in _generate_with_cache
    result = self._generate(
             ^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/langchain_openai/chat_models/base.py", line 629, in _generate
    response = self.root_client.beta.chat.completions.parse(**payload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/openai/resources/beta/chat/completions.py", line 118, in parse
    response_format=_type_to_response_format(response_format),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/openai/lib/_parsing/_completions.py", line 245, in type_to_response_format_param
    raise TypeError(f"Unsupported response_format type - {response_format}")
TypeError: Unsupported response_format type - <class 'llm_scraper.models.bronze.UrlsList'>

Did you manage to reproduce using the example?

VinciGit00 commented 3 weeks ago

can I have the code?

VinciGit00 commented 3 weeks ago

please try to use from langchain_core.pydantic_v1 import BaseModel, Field, validator instead of from pydantic import BaseModel, Field

bezineb5 commented 3 weeks ago

please try to use from langchain_core.pydantic_v1 import BaseModel, Field, validator instead of from pydantic import BaseModel, Field

I did, the result is here: https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/598#issuecomment-2318667354

bezineb5 commented 3 weeks ago

can I have the code?

Yes, it is available here: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_schema_openai.py

bezineb5 commented 3 weeks ago

It really looks like a langchain issue, or a misconfiguration. Did anything change in the chain/output parsing?

VinciGit00 commented 3 weeks ago

https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/pre/beta/examples/openai/smart_scraper_schema_openai.py

VinciGit00 commented 3 weeks ago

We changed to langchain pydantic

bezineb5 commented 3 weeks ago

Yeah but langchain is fine with pydantic v2, and I tested without success pydantic v1.

However, I investigated and I found that there is an issue inGenerateAnswerNode: both self.llm_model.with_structured_output and JsonOutputParser are used, but with_structured_output already adds an output parser: https://github.com/langchain-ai/langchain/blob/fabd3295fabb4c79fedb4dbbe725a308658ef8d8/libs/partners/openai/langchain_openai/chat_models/base.py#L1414C25-L1414C36

So it's effectively trying to parse a Pydantic object.

So you can either ask the llm class to return a structured output: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L103 Or provide an output parser: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L133C47-L133C47

(side note: it seems that json_schema mode is only implemented for openai chats, not for mistral or others - but langchain will ignore it silently: https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L100)

bezineb5 commented 3 weeks ago

I confirm that commenting out the block here https://github.com/ScrapeGraphAI/Scrapegraph-ai/blame/a96617d6f88f7370ecb7c58d0a62d3bdc0d80b31/scrapegraphai/nodes/generate_answer_node.py#L103 fixed the issue

lucidlogic commented 2 weeks ago

I can confirm using from langchain_core.pydantic_v1 import BaseModel, Field, validator running the above example https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/pre/beta/examples/openai/smart_scraper_schema_openai.py results in

--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Traceback (most recent call last):
  File "/Users/gareth/Code/Python/citify/test.py", line 48, in <module>
    result = smart_scraper_graph.run()
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
    return self._execute_standard(initial_state)
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard
    raise e
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard
    result = current_node.execute(state)
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 134, in execute
    answer = chain.invoke({"question": user_prompt})
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
    input = context.run(step.invoke, input, config)
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/runnables/base.py", line 5092, in invoke
    return self.bound.invoke(
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 276, in invoke
    self.generate_prompt(
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 776, in generate_prompt
    return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 633, in generate
    raise e
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 623, in generate
    self._generate_with_cache(
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py", line 845, in _generate_with_cache
    result = self._generate(
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain_openai/chat_models/base.py", line 629, in _generate
    response = self.root_client.beta.chat.completions.parse(**payload)
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/openai/resources/beta/chat/completions.py", line 118, in parse
    response_format=_type_to_response_format(response_format),
  File "/Users/gareth/.pyenv/versions/3.9.16/lib/python3.9/site-packages/openai/lib/_parsing/_completions.py", line 255, in type_to_response_format_param
    raise TypeError(f"Unsupported response_format type - {response_format}")
TypeError: Unsupported response_format type - <class '__main__.Projects'>
VinciGit00 commented 2 weeks ago

hi, we will fix in the next hours

LorenzoPaleari commented 2 weeks ago

In Version 1.16.0b2 is still broken.

There are two possible fix for this error, I tried them locally and they both work.

FIX 1 GenerateAnswerNode - Line 89

 if self.node_config.get("schema", None) is not None:

            # if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
            #     self.llm_model = self.llm_model.with_structured_output(
            #         schema = self.node_config["schema"],
            #         method="json_schema")
            # else: 
            output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])

        else:
            output_parser = JsonOutputParser()

The fix is commenting out the with_strctured_output part. Commenting out this part I tested smart_scraper_schema_openai both with pydantic and langchain_core.pydantic_v1 and it works.

FIX 2 GenerateAnswerNode L89

 if self.node_config.get("schema", None) is not None:

            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema = self.node_config["schema"],
                    method="json_schema")
            else: 
                output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

        else:
            output_parser = JsonOutputParser()
            format_instructions = output_parser.get_format_instructions()

Firstly, format instructions are not necessary anymore when using with_strcutured_output, so the instruction must be moved inside.

Than every invocation of the llm should be modified as follows:

prompt = PromptTemplate(
                template=template_no_chunks_prompt ,
                input_variables=["question"],
                partial_variables={"context": doc})
chain =  prompt | self.llm_model # | output_parser
answer = chain.invoke({"question": user_prompt})

removed the output parser and the format instructions.

With this solution there are 3 problems:

VinciGit00 commented 2 weeks ago

Hi please update to the new version

byt3bl33d3r commented 2 weeks ago

Can confirm the same error persists in 1.16.0:

Traceback (most recent call last):
  File "/workspaces/Marsala/marsala.py", line 47, in <module>
    result = smart_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 114, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 184, in _execute_standard
    raise e
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 168, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/scrapegraphai/nodes/generate_answer_node.py", line 134, in execute
    answer = chain.invoke({"question": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
    input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 5092, in invoke
    return self.bound.invoke(
           ^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 277, in invoke
    self.generate_prompt(
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 777, in generate_prompt
    return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 634, in generate
    raise e
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 624, in generate
    self._generate_with_cache(
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 846, in _generate_with_cache
    result = self._generate(
             ^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/langchain_openai/chat_models/base.py", line 652, in _generate
    response = self.root_client.beta.chat.completions.parse(**payload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/openai/resources/beta/chat/completions.py", line 118, in parse
    response_format=_type_to_response_format(response_format),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.cache/pypoetry/virtualenvs/marsala-7QkPqEIU-py3.12/lib/python3.12/site-packages/openai/lib/_parsing/_completions.py", line 255, in type_to_response_format_param
    raise TypeError(f"Unsupported response_format type - {response_format}")
TypeError: Unsupported response_format type - <class '__main__.Companies'>

Code:

import scrapegraphai
import json
from typing import List
#from pydantic import BaseModel, Field
from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict
from scrapegraphai.utils import prettify_exec_info
from scrapegraphai.graphs import SmartScraperGraph, OmniScraperGraph

class Settings(BaseSettings):
    openai_api_key: SecretStr

    model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8')

class Company(BaseModel):
    company: str = Field(description="Company name")
    description: str = Field(description="Company description")
    email: str = Field(description="Company email")

class Companies(BaseModel):
    companies: List[Company]

if __name__ == "__main__":
    # Define the configuration for the scraping pipeline
    settings = Settings()

    graph_config = {
        "llm": {
            "api_key": settings.openai_api_key.get_secret_value(),
            "model": "openai/gpt-4o-mini",
        },
        "verbose": True,
        "headless": True,
    }

    # Create the SmartScraperGraph instance
    smart_scraper_graph = SmartScraperGraph(
        prompt="Find some information about what the company does, the name and a contact email.",
        source="https://scrapegraphai.com/",
        schema=Companies,
        config=graph_config
    )

    # Run the pipeline
    result = smart_scraper_graph.run()
    print(result)
    #print(json.dumps(result, indent=4))

    graph_exec_info = smart_scraper_graph.get_execution_info()
    print(prettify_exec_info(graph_exec_info))
LorenzoPaleari commented 2 weeks ago

In Version 1.16.0b2 is still broken.

There are two possible fix for this error, I tried them locally and they both work.

FIX 1 GenerateAnswerNode - Line 89

 if self.node_config.get("schema", None) is not None:

            # if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
            #     self.llm_model = self.llm_model.with_structured_output(
            #         schema = self.node_config["schema"],
            #         method="json_schema")
            # else: 
            output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])

        else:
            output_parser = JsonOutputParser()

The fix is commenting out the with_strctured_output part. Commenting out this part I tested smart_scraper_schema_openai both with pydantic and langchain_core.pydantic_v1 and it works.

FIX 2 GenerateAnswerNode L89

 if self.node_config.get("schema", None) is not None:

            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema = self.node_config["schema"],
                    method="json_schema")
            else: 
                output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

        else:
            output_parser = JsonOutputParser()
            format_instructions = output_parser.get_format_instructions()

Firstly, format instructions are not necessary anymore when using with_strcutured_output, so the instruction must be moved inside.

Than every invocation of the llm should be modified as follows:

prompt = PromptTemplate(
                template=template_no_chunks_prompt ,
                input_variables=["question"],
                partial_variables={"context": doc})
chain =  prompt | self.llm_model # | output_parser
answer = chain.invoke({"question": user_prompt})

removed the output parser and the format instructions.

With this solution there are 3 problems:

  • Doubling all the prompts to have versions with and without format instructions
  • It needs a custom parser at the end to use the answer, output parser from langchain cannot be used with this mode, thus the answer is a well formatted string that is missing all the json annotation
projects=[Project(title='Rotary Pendulum RL', description='Open Source project aimed at controlling a real life rotary pendulum using RL algorithms.'), Project(title='DQN Implementation from scratch', description='Developed a Deep Q-Network algorithm to train a simple and double pendulum.'), Project(title='Multi Agents HAED', description='University project which focuses on simulating a multi-agent system to perform environment mapping. Agents, equipped with sensors, explore and record their surroundings, considering uncertainties in their readings.'), Project(title='Wireless ESC for Modular Drones', description='Modular drone architecture proposal and proof of concept. The project received maximum grade.')]
  • It only works when creating the schema with pydantic, langchain_core.pydantic_v1 do not work here.

Hi, I wanted to let you know that this comment is still valid for version 1.17.0, the latest one on branch pre/beta.

VinciGit00 commented 2 weeks ago

I would like the second one, can you implement it please?

LorenzoPaleari commented 2 weeks ago

I'm implementing it.

VinciGit00 commented 2 weeks ago

Ok thx. Is ok now?

LorenzoPaleari commented 2 weeks ago

I tested it on my end with and without schema and it works. Schema can be implemented with pydantic langchain_core.pydantic_v1 or TypedDict.

VinciGit00 commented 2 weeks ago

perfect. so I will close the issue. if someone still has problems I will update it

manu2022 commented 2 weeks ago

Hi @VinciGit00, the issue persists

LorenzoPaleari commented 2 weeks ago

Hi, did you tested with the 1.17.0b5 version?

manu2022 commented 2 weeks ago

Hi, did you tested with the 1.17.0b5 version?

I was using 1.16. It is working fine on 1.17.0b5, thanks!

LorenzoPaleari commented 2 weeks ago

I'm happy to hear that

rjbks commented 1 week ago

This issue persists in 1.18.1

Relevant packages: langchain-openai==0.1.23 openai==1.44.1 pydantic==2.9.1 pydantic_core==2.23.3 scrapegraphai==1.18.1

Error: File /opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py:341, in BaseModel.init(pydantic_self__, **data) 339 values, fields_set, validation_error = validate_model(pydantic_self.class, data) 340 if validation_error: --> 341 raise validation_error 342 try: 343 object_setattr(__pydantic_self, 'dict', values)

ValidationError: 1 validation error for Generation text str type expected (type=type_error.str)

Using from langchain_core.pydantic_v1 import BaseModel, Field, raises an error from the openai module. Some digging there shows openai client no longer supports older versions of pydantic.

EDIT1: Problem seems to arise from langchain_core/load/serializable.py:113, where it initializes the 'langchain_core.outputs.generation.Generation' pydantic model with {'text': MyScrapeGraphAIPydanticModel(...)} and this call in the BaseModel init function (pydantic/v1/main.py:339): values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data) yields: {'generation_info': None, 'type': 'Generation'}, {'text'}, Error

Except the 'text' field is expected to be a str, but it is my pydantic model passed in as the schema arg.

EDIT2

Seems to be fixed in v1.19.0-beta.2

LorenzoPaleari commented 1 week ago

v1.18.1 do not contain the fixed version

It is present in v1.19.0-beta.1+ If it works there it should be fine

rjbks commented 1 week ago

v1.18.1 do not contain the fixed version

It is present in v1.19.0-beta.1+ If it works there it should be fine

While it seems solved in the JSONScraperGraph with v1.19.0-beta.1+, the SearchGraph (and presumable underlying SmartScraperGraph) still have this issue. Same setup and error as my previous post above except upodated to scrapegraphai-1.19.0b7.

EDIT1#

Just tried the SearchGraph example and google models do not work, used the code from examples https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/examples/google_genai/search_graph_schema_gemini.py. Returns this error:

File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/utils/tokenizer.py", line 26, in num_tokens_calculus raise NotImplementedError(f"There is no tokenization implementation for model '{llm_model}'")

While this works for JSONScraperGraph, it does not work for SearchGraph
VinciGit00 commented 1 week ago

hi please update to the new beta

rjbks commented 1 week ago

hi please update to the new beta

Just updated to beta you released an hour ago. Google genai model now getting:

File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 90, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: 

Here is the text string if failed on (escapes are mine):

\`\`\`json
{
 "matches": [
  {
   "name": "JSS Medical College - India",
   "is_med_school": true,
   "city": "Mysore",
   "state": "Karnataka",
   "country": "India"
  }
 ]
}
\`\`\`

This looks correct to me, the langchain parse_json_markdown function seems to have something to strip the markdown, but still throws json decode error.

OpenAI model still throws same error as before:

File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)

EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.

EDIT2: reducing the max results and trying google genai again shows the same error as I originally posted on this thread. Here is the error:

 File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: {
 "matches": [
  {
   "name": "JSS Medical College",
   "is_med_school": True,
   "city": "Mysuru",
   "state": "Karnataka",
   "country": "India"
  }
 ]
}

Logging the following values (__pydantic_self.class__, data, values, fields_set) in the BaseModel class init method where one of the errors happen, shows this:

<class 'langchain_core.outputs.chat_generation.ChatGeneration'> {'message': AIMessage(content='{\n "matches": [\n  {\n   "name": "JSS Medical College",\n   "is_med_school": True,\n   "city": "Mysuru",\n   "state": "Karnataka",\n   "country": "India"\n  }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'text': '{\n "matches": [\n  {\n   "name": "JSS Medical College",\n   "is_med_school": True,\n   "city": "Mysuru",\n   "state": "Karnataka",\n   "country": "India"\n  }\n ]\n}', 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content='{\n "matches": [\n  {\n   "name": "JSS Medical College",\n   "is_med_school": True,\n   "city": "Mysuru",\n   "state": "Karnataka",\n   "country": "India"\n  }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'message'}

Looks like it is expecting the message value to be a string and it is getting AIMessage instead.

Then looking in the output_parser/json.py from langchain, it is logging this as the result[0].text strings:

Result text:  \```json
{
 "matches": [
  {
   "name": "JSS Medical College",
   "is_med_school": true,
   "city": "Mysuru",
   "state": "Karnataka",
   "country": "India"
  }
 ]
}
\```
Result text:  {
 "matches": [
  {
   "name": "JSS Medical College",
   "is_med_school": True,
   "city": "Mysuru",
   "state": "Karnataka",
   "country": "India"
  }
 ]
} <class 'str'>

Where the final version does not have the surrounding markdown ticks, but the json parser looks like it has this covered in the try/except block.

EDIT3:

Seems like "True" is not properly json encoded. Maybe instead of a bool type I should use a literal of "true" or "false"

LorenzoPaleari commented 1 week ago

EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.

Can you share the code you used for this? I want to dug deeper into the error

rjbks commented 1 week ago

EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.

Can you share the code you used for this? I want to dug deeper into the error

class MatchedSchool(BaseModel):
    name: str = Field(description="The name of the input candidate medical school.")
    is_med_school: bool = Field(description="Whether or not the input school is actually a medical school or has a medical program.")
    city: Optional[str] = Field(description="The city where the matched medical school campus/program facility is located, if available.")
    state: Optional[str] = Field(description="The state (or if international, the geographic/political region within the country) where the matched medical school campus/program facility is located, if available.")
    country: Optional[str] = Field(description="The country where the matched medical school campus/program facility is located, if available.")

class Matches(BaseModel):
    matches: List[MatchedSchool]

sg = SearchGraph(
    prompt=f'Verify that "JSS Medical College - India" is a medical school and provide location information if possible.',
    config={
        "llm": {
            "api_key": os.getenv('GOOGLE_GENAI_API_KEY'), #os.getenv('OPENAI_APIKEY'),
            "model": "google_genai/gemini-pro", #"openai/gpt-4o-2024-08-06",
            "temperature": 0,
        },
        # "cache_path": "./scrapegraph_cache",
        "verbose": False,
        "max_results": 3,
    },
    schema=Matches
)

res = sg.run()

print(json.dumps(res, indent=2))
print(json.dumps(sg.get_execution_info(), indent=2))

Seems like reducing the max results and fixing a type hint solves this. Apparently bool types seem to get converted to the python version title cased True or False which causes json parsing to fail.

Switching the typing to Literal["true", "false"] or Literal[0, 1] solves this. The reducing max_results clearly is not the answer here, there must be something else happening there.

LorenzoPaleari commented 1 week ago

After all you have discovered, the error is still present in both OpenAi and Google versions?

manu2022 commented 1 week ago

After all you have discovered, the error is still present in both OpenAi and Google versions?

It is working for me in 1.19beta with OpenAI

rjbks commented 1 week ago

After all you have discovered, the error is still present in both OpenAi and Google versions?

There seem to be 2, possibly unrelated, issues here. First is boolean json representations throw an error as they are represented in the python syntax (title case like True instead of true). Second seems to be something that is most likely specific to either the max_results number of pages searched, or with a specific chunk of data extracted from a specific page in those search results in my code above. These 2 errors, and not the specific error the OP started the thread with are the issues here. I suppose this belongs in another issue/thread. Change the max_results config value to 20 and see what happens with my specific query/code above. It throws for me consistently.

rjbks commented 1 week ago

hi please update to the new beta

Just updated to beta you released an hour ago. Google genai model now getting:

File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 90, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: 

Here is the text string if failed on (escapes are mine):

\`\`\`json
{
 "matches": [
  {
   "name": "JSS Medical College - India",
   "is_med_school": true,
   "city": "Mysore",
   "state": "Karnataka",
   "country": "India"
  }
 ]
}
\`\`\`

This looks correct to me, the langchain parse_json_markdown function seems to have something to strip the markdown, but still throws json decode error.

OpenAI model still throws same error as before:

File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)

EDIT1: OpenAI issue only happens when I set max results to a high number on a certain query, so it may be a specific source page that is causing this. Works well when using a different prompt or setting max pages to a lower number. Cannot check to see if this holds for google genai as I have exceeded my quote there.

EDIT2: reducing the max results and trying google genai again shows the same error as I originally posted on this thread. Here is the error:

 File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: {
 "matches": [
  {
   "name": "JSS Medical College",
   "is_med_school": True,
   "city": "Mysuru",
   "state": "Karnataka",
   "country": "India"
  }
 ]
}

Logging the following values (pydantic_self.class, data, values, fields_set) in the BaseModel class init method where one of the errors happen, shows this:

<class 'langchain_core.outputs.chat_generation.ChatGeneration'> {'message': AIMessage(content='{\n "matches": [\n  {\n   "name": "JSS Medical College",\n   "is_med_school": True,\n   "city": "Mysuru",\n   "state": "Karnataka",\n   "country": "India"\n  }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'text': '{\n "matches": [\n  {\n   "name": "JSS Medical College",\n   "is_med_school": True,\n   "city": "Mysuru",\n   "state": "Karnataka",\n   "country": "India"\n  }\n ]\n}', 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content='{\n "matches": [\n  {\n   "name": "JSS Medical College",\n   "is_med_school": True,\n   "city": "Mysuru",\n   "state": "Karnataka",\n   "country": "India"\n  }\n ]\n}', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-3c18bf1d-241a-4ec3-872e-14531811602b-0', usage_metadata={'input_tokens': 277, 'output_tokens': 64, 'total_tokens': 341})} {'message'}

Looks like it is expecting the message value to be a string and it is getting AIMessage instead.

Then looking in the output_parser/json.py from langchain, it is logging this as the result[0].text strings:

Result text:  \```json
{
 "matches": [
  {
   "name": "JSS Medical College",
   "is_med_school": true,
   "city": "Mysuru",
   "state": "Karnataka",
   "country": "India"
  }
 ]
}
\```
Result text:  {
 "matches": [
  {
   "name": "JSS Medical College",
   "is_med_school": True,
   "city": "Mysuru",
   "state": "Karnataka",
   "country": "India"
  }
 ]
} <class 'str'>

Where the final version does not have the surrounding markdown ticks, but the json parser looks like it has this covered in the try/except block.

EDIT3:

Seems like "True" is not properly json encoded. Maybe instead of a bool type I should use a literal of "true" or "false"

In the last example here you can see some funky stuff happening. Looks like the original markdown wrapped JSON is correct with boolean value as "true". Then the final result is a python dict object, but a string representation (you can see the <class 'str'> after the logged text).

rjbks commented 1 week ago

Some odd behavior being observed here. Using openai (4o and 4o-mini). I am seeing it not adhere to the pydantic model at all and no validation errors are occurring. For example:

class MatchedSchool(BaseModel):
    name: str = Field(description="The name of the input candidate medical school.")
    alternate_names: List[str] = Field(description="A list of alternate names referencing this school. Could be abbreviations, or fully spelled out names, as well as names of individual departments responsible for the Medical Curriculum within the school.")
    is_med_school: Literal["true", "false"] = Field(description="Whether or not the input school is actually a medical school or has a medical program.")
    city: Optional[str] = Field(description="The city where the matched medical school campus/program facility is located, if available.")
    state: Optional[str] = Field(description="The state (or if international, the geographic/political region within the country) where the matched medical school campus/program facility is located, if available.")
    country: Optional[str] = Field(description="The country where the matched medical school campus/program facility is located, if available.")
    source: str = Field(description="Source URL where this match was found.")

class Matches(BaseModel):
    matches: List[MatchedSchool] = Field(description="A list of matched medical schools.")

I am getting this json back with no validation errors:

{
  "is_medical_school": true,
  "location": {
    "city": "New York",
    "state": "New York",
    "country": "United States"
  },
  "source": "https://en.wikipedia.org/w/index.php?title=New_York_College_of_Podiatric_Medicine&oldid=1219437275",
  "sources": [
    "https://en.wikipedia.org/wiki/New_York_College_of_Podiatric_Medicine"
  ]
}

Then other times, there is a validation error, but I cannot reproduce, and temp is set to 0. 🤷‍♂️

rjbks commented 1 week ago

Same setup as last post, but with GPT4o-mini. This time it is the original issue: in pydantic/v1/main.py

pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)

Where pydantic validation results are:

<class 'langchain_core.outputs.generation.Generation'> {'text': Matches(matches=[MatchedSchool(name='Centro Universitário Franciscano', alternate_names=['UNIFRA'], is_med_school='true', city='Santa Maria', state='Rio Grande do Sul', country='Brazil', source='https://www.unifra.br')])} {'generation_info': None, 'type': 'Generation'} {'text'}

This time, we can see it is expecting the 'text' key to be a string but it is receiving the pydantic model instead.

VinciGit00 commented 1 week ago

Ok write me the code please

rjbks commented 1 week ago

Ok write me the code please

v1.19.0b8

class MatchedSchool(BaseModel):
    name: str = Field(description="The name of the input candidate medical school.")
    alternate_names: List[str] = Field(description="A list of alternate names referencing this school. Could be abbreviations, or fully spelled out names, as well as names of individual departments responsible for the Medical Curriculum within the school.")
    is_med_school: Literal["true", "false"] = Field(description="Whether or not the input school is actually a medical school or has a medical program.")
    city: Optional[str] = Field(description="The city where the matched medical school campus/program facility is located, if available.")
    state: Optional[str] = Field(description="The state (or if international, the geographic/political region within the country) where the matched medical school campus/program facility is located, if available.")
    country: Optional[str] = Field(description="The country where the matched medical school campus/program facility is located, if available.")
    source: str = Field(description="Source URL where this match was found.")

class Matches(BaseModel):
    matches: List[MatchedSchool] = Field(description="A list of matched medical schools.")

sg = SearchGraph(
    prompt=f'Search for information for the following medical school: "Centro Universitário Franciscano (UNIFRA) (2014 - 2018)"',
    config={
        "llm": {
            "api_key": os.getenv('OPENAI_APIKEY'), 
            "model": "openai/gpt-4o-mini-2024-07-18", 
            "temperature": 0,
        },
        # "cache_path": "./scrapegraph_cache",
        "verbose": False,
        "max_results": 3,
    },
    schema=Matches
)

res = sg.run()

print(json.dumps(res, indent=2))
print(json.dumps(sg.get_execution_info(), indent=2))

@VinciGit00

LorenzoPaleari commented 1 week ago

Which version are you testing on? I'm using OpenAI to test.

I discovered a weird bug in 1.1.9.0-beta8 that can be connected to this issue.

On GenerateAnswerNode we have a piece of code that choses the correct parser for the alms (it need to be updated since most of the models support this methods, but that's not the issue now)

if self.node_config.get("schema", None) is not None:
            print("Schema is not None")
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema = self.node_config["schema"],
                    method="function_calling") # json schema works only on specific models

                # default parser to empty function
                def output_parser(x):
                    return x
                if is_basemodel_subclass(self.node_config["schema"]):
                    print("Schema is a pydantic model")
                    output_parser = dict
                format_instructions = "NA"
            else:
                print("llm is not Openai")
                output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

        else:
            output_parser = JsonOutputParser()
            format_instructions = output_parser.get_format_instructions()

I added the prints

What happens here is: the first iteration of SmartScraperGraph the LLM is correctly initialised and I get the expected output

Schema is not None
Schema is a pydantic model

But the second time SmartScraperGraph is called and sub sequentially also GenerateAsnwerNode, I observe

Schema is not None
llm is not Openai

That means there is a problem with the initialisation of the llm that get lost in the way (since SmartScraperGraph class is initialised just once) Wrong parser is selected and this ends up in errors and weird stuff happening.

@VinciGit00

I can add all the other models that should be able to use with_strctured_output()(although I cannot test them, I do not have keys for everyone, I'll just rely on documentation). But the initialisation issue is much rooted in the code and it would be difficult for me to fix that.

@rjbks I'm guessing that with some specific pydantic version being used the "wrong parser" is able to parse all the OpenAi output, while the correct one should work for Pydantic, langchain Pydantic and Dict

For Google error, it is possible that adding CahtGoogleGenerativeAI here:

if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):

and fixing the initialisation will be enough to fix all errors

rjbks commented 1 week ago

Which version are you testing on? I'm using OpenAI to test.

I discovered a weird bug in 1.1.9.0-beta8 that can be connected to this issue.

On GenerateAnswerNode we have a piece of code that choses the correct parser for the alms (it need to be updated since most of the models support this methods, but that's not the issue now)

if self.node_config.get("schema", None) is not None:
            print("Schema is not None")
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema = self.node_config["schema"],
                    method="function_calling") # json schema works only on specific models

                # default parser to empty function
                def output_parser(x):
                    return x
                if is_basemodel_subclass(self.node_config["schema"]):
                    print("Schema is a pydantic model")
                    output_parser = dict
                format_instructions = "NA"
            else:
                print("llm is not Openai")
                output_parser = JsonOutputParser(pydantic_object=self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

        else:
            output_parser = JsonOutputParser()
            format_instructions = output_parser.get_format_instructions()

I added the prints

What happens here is: the first iteration of SmartScraperGraph the LLM is correctly initialised and I get the expected output

Schema is not None
Schema is a pydantic model

But the second time SmartScraperGraph is called and sub sequentially also GenerateAsnwerNode, I observe

Schema is not None
llm is not Openai

That means there is a problem with the initialisation of the llm that get lost in the way (since SmartScraperGraph class is initialised just once) Wrong parser is selected and this ends up in errors and weird stuff happening.

@VinciGit00

@LorenzoPaleari

So I am using v1.19.0b8, the version you mention. And it looks like it is getting the correct json from the llm, then parsing it to python dict then taking the python dict and stringifying it and wrapping within markdown. Here is one of the validation errors:

langchain_core.exceptions.OutputParserException: Invalid json output: \```json
{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'false', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}
\```

(I escaped the first markdown tick so it wouldn't format it)

LorenzoPaleari commented 1 week ago

So I am using v1.19.0b8, the version you mention. And it looks like it is getting the correct json from the llm, then parsing it to python dict then taking the python dict and stringifying it and wrapping within markdown. Here is one of the validation errors:

langchain_core.exceptions.OutputParserException: Invalid json output: ```json {'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'false', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]} ``` (I escaped the first markdown tick so it wouldn't format it)

Can you tell in which iteration is this happening?

(I removed all the asyncio code to understand better the sequence)

DEFAULT_BATCHSIZE = 1

class GraphIteratorNode(BaseNode): """ A node responsible for instantiating and running multiple graph instances in parallel. It creates as many graph instances as the number of elements in the input list.

Attributes:
    verbose (bool): A flag indicating whether to show print statements during execution.

Args:
    input (str): Boolean expression defining the input keys needed from the state.
    output (List[str]): List of output keys to be updated in the state.
    node_config (dict): Additional configuration for the node.
    node_name (str): The unique identifier name for the node, defaulting to "Parse".
"""

def __init__(
    self,
    input: str,
    output: List[str],
    node_config: Optional[dict] = None,
    node_name: str = "GraphIterator",
):
    super().__init__(node_name, "node", input, output, 2, node_config)

    self.verbose = (
        False if node_config is None else node_config.get("verbose", False)
    )

def execute(self, state: dict) -> dict:
    """
    Executes the node's logic to instantiate and run multiple graph instances in parallel.

    Args:
        state (dict): The current state of the graph. The input keys will be used to fetch
                        the correct data from the state.

    Returns:
        dict: The updated state with the output key c
        ontaining the results of the graph instances.

    Raises:
        KeyError: If the input keys are not found in the state, 
        indicating that thenecessary information for running 
        the graph instances is missing.
    """
    batchsize = self.node_config.get("batchsize", DEFAULT_BATCHSIZE)

    self.logger.info(
        f"--- Executing {self.node_name} Node with batchsize {batchsize} ---"
    )

    state = self._async_execute(state, batchsize)

    return state

def _async_execute(self, state: dict, batchsize: int) -> dict:
    """asynchronously executes the node's logic with multiple graph instances
    running in parallel, using a semaphore of some size for concurrency regulation

    Args:
        state: The current state of the graph.
        batchsize: The maximum number of concurrent instances allowed.

    Returns:
        The updated state with the output key containing the results
        aggregated out of all parallel graph instances.

    Raises:
        KeyError: If the input keys are not found in the state.
    """

    input_keys = self.get_input_keys(state)

    input_data = [state[key] for key in input_keys]

    user_prompt = input_data[0]
    urls = input_data[1]

    graph_instance = self.node_config.get("graph_instance", None)

    if graph_instance is None:
        raise ValueError("graph instance is required for concurrent execution")

    if "graph_depth" in graph_instance.config:
        graph_instance.config["graph_depth"] += 1
    else:
        graph_instance.config["graph_depth"] = 1

    graph_instance.prompt = user_prompt

    participants = []

    for url in urls:
        instance = copy.copy(graph_instance)
        instance.source = url
        if url.startswith("http"):
            instance.input_key = "url"
        participants.append(instance)

    futures = []
    for graph in participants:
        print(f"Running graph instance for {graph.source}")
        futures.append(graph.run())

    state.update({self.output[0]: futures})

    return state
rjbks commented 1 week ago

S0 this time, with the same settings, openai gpt-40-mini, I am getting the original error:

File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/pydantic/v1/main.py", line 342, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Generation
text
  str type expected (type=type_error.str)

Where text value is a pydantic model instead of str.

Then with gemini-1.5-pro (but not flash), the previous error where it looked like a python dict stringified and wrapped in markdown (where there were single quotes):

<class 'langchain_core.outputs.chat_generation.ChatGeneration'> {'message': AIMessage(content="```json\n{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}\n```\n", response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-1e086766-9ec8-4569-9490-ab1d6c6559ab-0', usage_metadata={'input_tokens': 239, 'output_tokens': 76, 'total_tokens': 315})} {'text': "```json\n{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}\n```\n", 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content="\```json\n{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}\n\```\n", response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-1e086766-9ec8-4569-9490-ab1d6c6559ab-0', usage_metadata={'input_tokens': 239, 'output_tokens': 76, 'total_tokens': 315})} {'message'}
Traceback (most recent call last):
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 84, in parse_result
    return parse_json_markdown(text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 147, in parse_json_markdown
    return _parse_json(json_str, parser=parser)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 163, in _parse_json
    return parser(json_str)
           ^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 118, in parse_partial_json
    return json.loads(s, strict=strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
    res = sg.run()
          ^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
    raise e
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/merge_answers_node.py", line 102, in execute
    answer = merge_chain.invoke({"user_prompt": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
    input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 183, in invoke
    return self._call_with_config(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config
    context.run(
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 398, in call_func_with_variable_args
    return func(input, **kwargs)  # type: ignore[call-arg]
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 184, in <lambda>
    lambda inner_input: self.parse_result(
                        ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: \```json
{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}
\```

You can see an earlier iteration with correct formatting:

{'message': AIMessage(content='\```json\n{"matches": [{"name": "Centro Universitário Franciscano (UNIFRA)", "alternate_names": ["UNIFRA"], "is_med_school": "NA", "city": "NA", "state": "NA", "country": "NA", "source": "https://unifra.academia.edu/"}]}\n\```', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-b0038c3f-b4d4-4d32-91bb-db9b3ecf1f1d-0', usage_metadata={'input_tokens': 8087, 'output_tokens': 76, 'total_tokens': 8163})} {'text': '\```json\n{"matches": [{"name": "Centro Universitário Franciscano (UNIFRA)", "alternate_names": ["UNIFRA"], "is_med_school": "NA", "city": "NA", "state": "NA", "country": "NA", "source": "https://unifra.academia.edu/"}]}\n\```', 'generation_info': None, 'type': 'ChatGeneration', 'message': AIMessage(content='\```json\n{"matches": [{"name": "Centro Universitário Franciscano (UNIFRA)", "alternate_names": ["UNIFRA"], "is_med_school": "NA", "city": "NA", "state": "NA", "country": "NA", "source": "https://unifra.academia.edu/"}]}\n\```', response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-b0038c3f-b4d4-4d32-91bb-db9b3ecf1f1d-0', usage_metadata={'input_tokens': 8087, 'output_tokens': 76, 'total_tokens': 8163})}

Then later that gets turned into a python dict then stringified from python into markdown

rjbks commented 1 week ago

@LorenzoPaleari

Output from above with gemini 1.5 pro (openai 4o mini still causes original error where expects string but gets pydantic model):

Model google_genai/gemini-1.5-pro not found, 
                  using default token size (8192)
Model google_genai/gemini-1.5-pro not found, 
                  using default token size (8192)
--- Executing SearchInternet Node ---
Search Query: "Centro Universitário Franciscano (UNIFRA)" medical school
--- Executing GraphIterator Node with batchsize 1 ---
Running graph instance for https://www.nanoorbit.com/nanotechnology-companies/centro-universitario-franciscano-unifra.html
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.nanoorbit.com/nanotechnology-companies/centro-universitario-franciscano-unifra.html) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Running graph instance for https://unifra.academia.edu/Departments/Potosi/Documents
--- Executing Fetch Node ---
--- (Fetching HTML from: https://unifra.academia.edu/Departments/Potosi/Documents) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Running graph instance for https://www.scielo.br/j/ean/a/JFLxF47FBSykSPfJGY8DBgj/?format=pdf&lang=en
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.scielo.br/j/ean/a/JFLxF47FBSykSPfJGY8DBgj/?format=pdf&lang=en) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
--- Executing MergeAnswers Node ---
Traceback (most recent call last):
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 84, in parse_result
    return parse_json_markdown(text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 147, in parse_json_markdown
    return _parse_json(json_str, parser=parser)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 163, in _parse_json
    return parser(json_str)
           ^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/utils/json.py", line 118, in parse_partial_json
    return json.loads(s, strict=strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rb/PycharmProjects/med_device_crawler/scripts/search_med_schools.py", line 59, in <module>
    res = sg.run()
          ^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/search_graph.py", line 120, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 253, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 175, in _execute_standard
    raise e
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 159, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/scrapegraphai/nodes/merge_answers_node.py", line 102, in execute
    answer = merge_chain.invoke({"user_prompt": user_prompt})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 2878, in invoke
    input = context.run(step.invoke, input, config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 183, in invoke
    return self._call_with_config(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1785, in _call_with_config
    context.run(
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/runnables/config.py", line 398, in call_func_with_variable_args
    return func(input, **kwargs)  # type: ignore[call-arg]
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/base.py", line 184, in <lambda>
    lambda inner_input: self.parse_result(
                        ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/med_device/lib/python3.12/site-packages/langchain_core/output_parsers/json.py", line 87, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: ```json
{'matches': [{'name': 'Centro Universitário Franciscano (UNIFRA)', 'alternate_names': ['UNIFRA'], 'is_med_school': 'NA', 'city': 'NA', 'state': 'NA', 'country': 'NA', 'source': 'https://unifra.academia.edu/'}]}
\```
LorenzoPaleari commented 1 week ago

@rjbks

Output from above with gemini 1.5 pro (openai 4o mini still causes original error where expects string but gets pydantic model):

Thank you! Can you also share full output of gpt-4o-mini?