BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.59k stars 1.97k forks source link

add exclude to the script (my first pull request) #101

Closed Webstudio88 closed 8 months ago

Webstudio88 commented 10 months ago

Hello,

I am excited to submit this pull request, which introduces a new feature to the GPT Crawler project. This feature enables users to exclude specific HTML tags from the scraping process, thereby enhancing the cleanliness and relevance of the data extracted.

Key Changes

Motivation

In many web scraping scenarios, it's crucial to focus only on relevant data while excluding unnecessary elements like headers, footers, and scripts. This feature addresses that need by allowing users to specify elements to exclude, thus streamlining the data extraction process for cleaner and more efficient results.

I believe this feature will be a valuable addition to the GPT Crawler project, offering users more control over the data they are scraping. I look forward to your feedback and hope to contribute further to the development of this project.

Best regards, Peter Goedhart

marcelovicentegc commented 8 months ago

Hey, @Webstudio88! Thanks for the PR! This feature was addressed on https://github.com/BuilderIO/gpt-crawler/pull/122 by leveraging Crawlee's built-in exclude feature, so I'm closing this PR :hugs: