I am excited to submit this pull request, which introduces a new feature to the GPT Crawler project. This feature enables users to exclude specific HTML tags from the scraping process, thereby enhancing the cleanliness and relevance of the data extracted.
Key Changes
Added an selectorexcl option in the crawler configuration.
Updated the getPageHtml function to handle the exclusion of specified HTML elements.
Included examples and instructions in the README for utilizing this new feature.
Motivation
In many web scraping scenarios, it's crucial to focus only on relevant data while excluding unnecessary elements like headers, footers, and scripts. This feature addresses that need by allowing users to specify elements to exclude, thus streamlining the data extraction process for cleaner and more efficient results.
I believe this feature will be a valuable addition to the GPT Crawler project, offering users more control over the data they are scraping. I look forward to your feedback and hope to contribute further to the development of this project.
Hey, @Webstudio88! Thanks for the PR! This feature was addressed on https://github.com/BuilderIO/gpt-crawler/pull/122 by leveraging Crawlee's built-in exclude feature, so I'm closing this PR :hugs:
Hello,
I am excited to submit this pull request, which introduces a new feature to the GPT Crawler project. This feature enables users to exclude specific HTML tags from the scraping process, thereby enhancing the cleanliness and relevance of the data extracted.
Key Changes
selectorexcl
option in the crawler configuration.getPageHtml
function to handle the exclusion of specified HTML elements.Motivation
In many web scraping scenarios, it's crucial to focus only on relevant data while excluding unnecessary elements like headers, footers, and scripts. This feature addresses that need by allowing users to specify elements to exclude, thus streamlining the data extraction process for cleaner and more efficient results.
I believe this feature will be a valuable addition to the GPT Crawler project, offering users more control over the data they are scraping. I look forward to your feedback and hope to contribute further to the development of this project.
Best regards, Peter Goedhart