ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.69k stars 1.2k forks source link

The script smart_scraper_schema_azure.py from the example/azure directory cannot be executed because the 'SmartScraperGraph' object has no attribute 'model_token'. #434

Closed mingjun1120 closed 2 months ago

mingjun1120 commented 3 months ago

Describe the bug I attempted to execute the smart_scraper_schema_azure.py script from the Scrapegraph-ai/example/azure GitHub directory, but encountered the following issue:

Traceback (most recent call last):
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\scrape.py", line 53, in <module>
    smart_scraper_graph = SmartScraperGraph(
                          ^^^^^^^^^^^^^^^^^^
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 53, in __init__
    super().__init__(prompt, config, source, schema)
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\abstract_graph.py", line 84, in __init__
    self.graph = self._create_graph()
                 ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 75, in _create_graph
    "chunk_size": self.model_token
                  ^^^^^^^^^^^^^^^^
AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

To Reproduce (Code) I am just using the sample Azure OpenAI code only.

""" 
Basic example of scraping pipeline using SmartScraper with schema
"""

import os, json
from typing import List
from pydantic import BaseModel, Field
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from scrapegraphai.graphs import SmartScraperGraph, SmartScraperMultiGraph

load_dotenv()

# ************************************************
# Define the output schema for the graph
# ************************************************

class Project(BaseModel):
    title: str = Field(description="The title of the project")
    description: str = Field(description="The description of the project")

class Projects(BaseModel):
    projects: List[Project]

# ************************************************
# Initialize the model instances
# ************************************************

llm_model_instance = AzureChatOpenAI(
    openai_api_key = os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"],
    openai_api_version = os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment = os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"],

)

embedder_model_instance = AzureOpenAIEmbeddings(
    openai_api_key = os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"],
    openai_api_version = os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment = os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
)

graph_config = {
    "llm": {"model_instance": llm_model_instance},
    "embeddings": {"model_instance": embedder_model_instance}
}
# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description",
    source="https://perinim.github.io/projects/",
    schema=Projects,
    config=graph_config
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Expected behavior It should be able to print the JSON output for the extracted the data

Desktop (please complete the following information):

marcantoinefortier commented 3 months ago

This seems to be a duplicate of https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/422

f-aguzzi commented 3 months ago

Try this example (I have no idea if it will work, because I don't have access to Azure to test it):

""" 
Basic example of scraping pipeline using SmartScraper using Azure OpenAI Key
"""

import os
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# required environment variable in .env
# AZURE_OPENAI_KEY

graph_config = {
    "llm": {
        "api_key": os.environ["AZURE_OPENAI_KEY"],
        "model": "azure/gpt-3.5-turbo",
    },
    "verbose": True,
    "headless": False
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the titles",
    source="https://sport.sky.it/nba?gr=www",
    config=graph_config
)

smart_scraper_graph = SmartScraperGraph(
    prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, 
    event_end_date, event_end_time, location, event_mode, event_category, 
    third_party_redirect, no_of_days, 
    time_in_hours, hosted_or_attending, refreshments_type, 
    registration_available, registration_link""",
    # also accepts a string with the already downloaded HTML code
    source="https://www.hmhco.com/event",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

If it works, we'll put it in place instead of the current example.

mingjun1120 commented 3 months ago

@f-aguzzi, I got this error. Model not supported

Traceback (most recent call last):
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\abstract_graph.py", line 153, in 
_create_llm
    self.model_token = models_tokens["openai"][llm_params["model"]]
                       ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
KeyError: 'azure/gpt-3.5-turbo'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\scrapegraph.py", line 27, in <module>
    smart_scraper_graph = SmartScraperGraph(
                          ^^^^^^^^^^^^^^^^^^
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 53, in __init__
    super().__init__(prompt, config, source, schema)
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\abstract_graph.py", line 73, in __init__
    self.llm_model = self._create_llm(config["llm"], chat=True)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GV631HJ\OneDrive - EY\Desktop\CIMB\venv\Lib\site-packages\scrapegraphai\graphs\abstract_graph.py", line 155, in 
_create_llm
    raise KeyError("Model not supported") from exc
KeyError: 'Model not supported'
mingjun1120 commented 3 months ago

@f-aguzzi, I made some changes to the code (bolded below), but it's still not functioning. I did deploy the gpt-4o with the exact same name, which is also called gpt-4o in my Azure OpenAI environment.

""" 
Basic example of scraping pipeline using SmartScraper using Azure OpenAI Key
"""
import os
import requests
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# required environment variable in .env
# AZURE_OPENAI_KEY

graph_config = {
    "llm": {
        "api_key": os.getenv("AZURE_OPENAI_API_KEY"), <--------- Updated
        "model": "gpt-4o", <--------- Updated
    },
    "verbose": True,
    "headless": False
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the titles",
    source="https://sport.sky.it/nba?gr=www",
    config=graph_config
)

smart_scraper_graph = SmartScraperGraph(
    prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, 
    event_end_date, event_end_time, location, event_mode, event_category, 
    third_party_redirect, no_of_days, 
    time_in_hours, hosted_or_attending, refreshments_type, 
    registration_available, registration_link""",
    # also accepts a string with the already downloaded HTML code
    source="https://www.hmhco.com/event",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

The error: Did not find openai_api_key

Traceback (most recent call last):
  File "/teamspace/studios/this_studio/test.py", line 27, in <module>
    smart_scraper_graph = SmartScraperGraph(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 53, in __init__
    super().__init__(prompt, config, source, schema)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/graphs/abstract_graph.py", line 73, in __init__
    self.llm_model = self._create_llm(config["llm"], chat=True)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/graphs/abstract_graph.py", line 156, in _create_llm
    return OpenAI(llm_params)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/models/openai.py", line 17, in __init__
    super().__init__(**llm_config)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for OpenAI
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)
⚡ ~
⚡ ~ python test.py
Traceback (most recent call last):
  File "/teamspace/studios/this_studio/test.py", line 27, in <module>
    smart_scraper_graph = SmartScraperGraph(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 53, in __init__
    super().__init__(prompt, config, source, schema)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/graphs/abstract_graph.py", line 73, in __init__
    self.llm_model = self._create_llm(config["llm"], chat=True)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/graphs/abstract_graph.py", line 156, in _create_llm
    return OpenAI(llm_params)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/scrapegraphai/models/openai.py", line 17, in __init__
    super().__init__(**llm_config)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for OpenAI
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)
VinciGit00 commented 2 months ago

you should have done graph_config = { "llm": { "api_key": os.environ["AZURE_OPENAI_KEY"], "model": "azure/gpt-4o"", }, "verbose": True, "headless": False }

haizadtarik commented 2 months ago

There is a bug with the implementation of passing the model instance directly instead of model details. The code did not assign self.model_token if model instance is passed directly. https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/208ab267ceda30b4527222d9dfd61e5c5ed243c3/scrapegraphai/graphs/abstract_graph.py#L149-L151

Quick Fix

  1. Add model_token assignment before returning instantiated model
    # If model instance is passed directly instead of the model details
        if "model_instance" in llm_params:
            try:
                self.model_token = llm_params["model_tokens"]
            except KeyError as exc:
                raise KeyError("model_tokens not specified") from exc
            return llm_params["model_instance"]
  2. Manually passed the model token in graph_config
    
    graph_config = {
    "llm": {
        "model_instance": model_instance,
        "model_tokens": <YOUR_MODEL_TOKEN>,
    }
    }
VinciGit00 commented 2 months ago

thank you fort the tip, please update to the new version