ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
13.98k stars 1.11k forks source link

Issue with Extracting URLs Using ScrapeGraphAI in Flask Application #451

Closed deejay99 closed 1 month ago

deejay99 commented 1 month ago

Discussed in https://github.com/ScrapeGraphAI/Scrapegraph-ai/discussions/450

Originally posted by **deejay99** July 13, 2024 Hi everyone, I'm currently working on a Flask application that uses ScrapeGraphAI to scrape web pages. While my application successfully scrapes pages and extracts content, I only get NA's back for URLs of listings on scraped pages. Interestingly, the official demo of ScrapeGraphAI can extract URLs of listings on pages without any issues. I'm hoping someone here might be able to help me identify and resolve the problem. eg on {"source":"https://park-immobilien.ch/kaufen", "prompt":"List me the first 10 listings incl. urls"} My Setup Here's a brief overview of my setup: Framework: Flask Deployment Platform: Local and Heroku - both having same problem Code Snippet Below is the relevant part of my app.py: ``` from flask import Flask, request, jsonify from scrapegraphai.graphs import SmartScraperGraph import os from dotenv import load_dotenv load_dotenv() app = Flask(__name__) @app.route('/scrape', methods=['POST']) def scrape(): data = request.json prompt = data.get('prompt', 'List me all the articles') source = data.get('source', 'https://perinim.github.io/projects') graph_config = { "llm": { "api_key": os.getenv("OPENAI_API_KEY"), "model": "gpt-4o", }, } smart_scraper_graph = SmartScraperGraph( prompt=prompt, source=source, config=graph_config ) result = smart_scraper_graph.run() return jsonify(result) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000) ``` and then `curl -X POST http://localhost:5000/scrape -H "Content-Type: application/json" -d '{"source":"https://park-immobilien.ch/kaufen", "prompt":"List me the first 10 listings incl. their URL"}'`
f-aguzzi commented 1 month ago

There's a specific type of node that might be useful for your use case: the Search Link Node. It is specialized in finding links within a webpage that might be related to a specific user query.

If this route might seem interesting to you, we could try to make a custom graph based on this node, specifically to find links. Otherwise, if you want to keep a one-size-fits-all solution for your web app, we could just try and debug the Smart Scraper to find out why it doesn't return links. Or both. Let us know and we'll be happy to collaborate 🤝

VinciGit00 commented 1 month ago

Hi please update to the new beta

deejay99 commented 1 month ago

Thanks both, will do and test asap, traveling today/tmr!

deejay99 commented 4 weeks ago

works perfectly now, much appreciated!