ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.91k stars 1.3k forks source link

Scraping and look for information #830

Open silgon opened 1 day ago

silgon commented 1 day ago

The following is a script from the readme with two lines changed, prompt and source in the parameters' list of SmartScraperGraph. I ask to try to recover the classes used to search on the web with the ScrapeGraph-AI library. What I intend to do is a search that the script can crawl inside some of the tabs since i know that the information should be inside one of the pages.

import json
from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_APIKEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Find some information the classes used in the library for searching on the web",
    source="https://scrapegraph-ai.readthedocs.io/en/latest",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Result

--- Executing Fetch Node ---
--- (Fetching HTML from: https://scrapegraph-ai.readthedocs.io/en/latest) ---
--- Executing ParseNode Node ---
--- Executing GenerateAnswer Node ---
{
    "classes": "NA"
}

Well, of course it does not get the information since as you can see from the logs, it only fetches the url I gave. Is there any way within the library to do this? Thanks!

VinciGit00 commented 13 hours ago

Iterate it

silgon commented 2 hours ago

Well, that's kind of my first thought, I asked the question since you guys have a nice structure of nodes, I thought that that was a task that maybe falls nicer into that. I guess not then. Thx for the reply.