ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.35k stars 1.17k forks source link

LLM-powered RSS Feed Generator with Full-Text Extraction and Auto-Updating Tags #586

Open berkbirkan opened 2 weeks ago

berkbirkan commented 2 weeks ago

I am developing a product that requires converting any webpage into an RSS feed (in XML or JSON format). If an RSS feed URL is already available (thus no need to create it from scratch), we would need a full-text RSS converter (similar to what packages like FiveFilters or Morss do). As you know, most RSS feeds only provide descriptions, and to get the full content, one needs to visit the URL.

Using packages like Newspaper3k to extract URLs is quite challenging because they often do not work well across different websites due to varying site structures. Here is where an LLM (Large Language Model) could help to solve these issues by understanding the HTML tag structure of websites. It would then save the necessary HTML tags for various data points, such as title, description, full content, post link, and tags, in a database (e.g., MySQL, PostgreSQL). This way, the LLM wouldn’t need to be run each time.

If the website's structure changes, causing the stored tags in the database to become invalid, the system should automatically detect this and update the tags using the LLM, ensuring seamless content extraction regardless of how often the site is updated.

Additionally, there is often an issue with IP bans when extracting content from the URLs for each post within an RSS feed. To solve this, we would also need proxy support.

Proposed Solution:

LLM-Powered HTML Tag Extraction: Utilize an LLM to analyze the HTML structure of various websites to determine the appropriate tags for content extraction (title, description, full content, post link, tags, etc.). Store these tags in a database for future use.

Auto-Update Tags on Structure Change: Implement a system that monitors websites for structural changes and automatically triggers the LLM to update the stored HTML tags in the database.

Full-Text RSS Converter: Develop a feature that can convert partial-content RSS feeds into full-text feeds by fetching the full content using the stored HTML tags.

Proxy Support: Integrate proxy support to prevent IP bans during content extraction.

This solution will allow for efficient RSS feed generation and content extraction without creating thousands of different Python scripts, which is not practical for scaling across thousands of websites.

Additional Context:

The goal is to create an effective, scalable solution for generating and maintaining RSS feeds with full content from a wide variety of websites, adapting to changes in site structure without constant manual intervention.

f-aguzzi commented 2 weeks ago

Thanks for reaching out and taking interest in our project.

Conceptually, this seems close to our ScriptCreatorGraph, a graph to automate the creation of self-updating BeautifulSoup scrapers. The idea was, just like in your project, to use the LLM to set up a "static" scraper and then update it when it breaks due to changes in the page structure. Unfortunately, the idea was never pursued to its full potential and the implementation is very basic. This was partially due to lack of interest from the users.

Your idea, however, brings this idea back to the table. The team is a bit busy at the moment, and I'm just a maintainter with no authority to approve or deny collaborations, but we might be potentially interested in reopening the path of hybrid / self-updating scrapers. I'll let the others know.