BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.16k stars 1.88k forks source link

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

Closed guillermoscript closed 7 months ago

guillermoscript commented 7 months ago

This pull request includes several changes to improve the functionality of the code:

  1. Refactored the getPageHtml function to handle the case when the specified selector is not found on the page. In this case, the function now falls back to using the body selector to retrieve the page content.

  2. Added a try-catch block to handle the case when the specified selector is not found during the page crawl. If the selector is not found, a warning message is logged and the function falls back to using the body selector.

  3. Added support for downloading URLs from a sitemap.xml file. If the provided URL is a sitemap, all pages listed in the sitemap will be crawled.

  4. Updated comments in the code to indicate that sitemap support has been added.

These changes improve the robustness and flexibility of the code, allowing it to handle cases where the specified selector is not found and enabling the crawling of pages listed in a sitemap.

Fixes #16

vaibhavkumar-sf commented 7 months ago

Preparing review...

vaibhavkumar-sf commented 7 months ago

Preparing review...

vaibhavkumar-sf commented 7 months ago

Preparing review...

guillermoscript commented 7 months ago

Very cool @guillermoscript! We just have a merge conflict and once resolved we can get this in

thanks! I just updated the code, basically just adding the sitemap support to this new version and the block resouce list prop, so users can skip images for example, if you want to test those I would recommend you to use

let me know if any other change is required :D

steve8708 commented 7 months ago

looks great, just a couple new merge conflicts then we're good to go

guillermoscript commented 7 months ago

looks great, just a couple new merge conflicts then we're good to go

conflict resolved 👍

github-actions[bot] commented 7 months ago

:tada: This PR is included in version 1.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: