BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.14k stars 1.88k forks source link

Only one tag html for all the page #128

Open Th3Heavy opened 5 months ago

Th3Heavy commented 5 months ago

Hello, I only have an html tag which contains all the content of the page, is this normal? Shouldn't I have multiple tags with the page information cut out?

└─$ cat config.ts import { Config } from "./src/config";

export const defaultConfig: Config = { url: "https://docs.ansible.com/ansible/latest", match: "https://docs.ansible.com/ansible/latest/collections/ansible/builtin/file_module.html", maxPagesToCrawl: 100000, outputFileName: "output.json", };

└─$ cat output-1.json [ { "title": "Ansible Documentation — Ansible Documentation", "url": "https://docs.ansible.com/ansible/latest/", "html": "ANSIBLEFEST\nPRODUCTS\nCOMMUNITY\nWEBINARS & TRAINING\nBLOG\nDocumentation\n Ansible\n9\nSelect version:\nlatest\n2.9\ndevel\nSearch docs:\n\nANSIBLE GETTING STARTED\n\nGetting started with Ansible\nGetting started with Execution Environments\n\nINSTALLATION, UPGRADE & CONFIGURATION\n\nInstallation Guide\nAnsible Porting Guides\n\nUSING ANSIBLE\n\nBuilding Ansible inventories\nUsing Ansible command line tools\nUsing Ansible playbooks\nProtecting sensitive data with Ansible vault\nUsing Ansible modules and plugins\nUsing Ansible collections\nUsing Ansible on Windows and BSD\nAnsible tips and tricks\n\nCONTRIBUTING TO ANSIBLE\n\nAnsible Community Guide\nAnsible Collections Contributor Guide\nansible-core Contributors Guide\nAdvanced Contributor Guide\nAnsible documentation style guide\n\nEXTENDING ANSIBLE\n\nDeveloper Guide\n\nCOMMON ANSIBLE SCENARIOS\n\nLegacy Public Cloud Guides\nNetwork Technology Guides\nVirtualization and Containerization Guides\n\nNETWORK AUTOMATION\n\nNetwork Getting Started\nNetwork Advanced Topics\nNetwork Developer Guide\n\nANSIBLE GALAXY\n\nGalaxy User Guide\nGalaxy Developer Guide\n\nREFERENCE & APPENDICES\n\nCollection Index\nIndexes of all modules and plugins\nPlaybook Keywords\nReturn Values\nAnsible Configuration Settings\nControlling how Ansible behaves: precedence rules\nYAML Syntax\nPython 3 Support\nInterpreter Discovery\nReleases and maintenance\nTesting Strategies\nSanity Tests\nFrequently Asked Questions\nGlossary\nAnsible Reference: Module Utilities\nSpecial Variables\nRed Hat Ansible Automation Platform\nAnsible Automation Hub\nLogging Ansible output\n\nROADMAPS\n\nAnsible Roadmap\nansible-core Roadmaps\n\n\n\n\n\n\n Ansible Documentation\n\n\n\nDiscuss Ansible in the new Ansible Forum!\n\nThis is the latest (stable) community version of the Ansible documentation. For Red Hat customers, see the difference between Ansible community projects and Red Hat supported products or Ansible Automation Platform Life Cycle for subscriptions.\n\nAnsible Documentation\n\nWelcome to Ansible community documentation! This documentation covers the version of Ansible noted in the upper left corner of this page. We maintain multiple versions of Ansible and of the documentation, so please be sure you are using the version of the documentation that covers the version of Ansible you’re using. For recent features, we note the version of Ansible where the feature was added.\n\nAnsible releases a new major release approximately twice a year. The core application evolves somewhat conservatively, valuing simplicity in language design and setup. Contributors develop and change modules and plugins, hosted in collections, much more quickly.\n\nAnsible getting started\n\nGetting started with Ansible\nIntroduction to Ansible\nStart automating with Ansible\nBuilding an inventory\nCreating a playbook\nAnsible concepts\nGetting started with Execution Environments\nAnsible ecosystem\n\nInstallation, Upgrade & Configuration\n\nInstallation Guide\nInstalling Ansible\nInstalling Ansible on specific operating systems\nConfiguring Ansible\nAnsible Porting Guides\nAnsible 9 Porting Guide\nAnsible 8 Porting Guide\nAnsible 7 Porting Guide\nAnsible 6 Porting Guide\nAnsible 5 Porting Guide\nAnsible 4 Porting Guide\nAnsible 3 Porting Guide\nAnsible 2.10 Porting Guide\nAnsible 2.9 Porting Guide\nAnsible 2.8 Porting Guide\nAnsible 2.7 Porting Guide\nAnsible 2.6 Porting Guide\nAnsible 2.5 Porting Guide\nAnsible 2.4 Porting Guide\nAnsible 2.3 Porting Guide\nAnsible 2.0 Porting Guide\n\nUsing Ansible\n\nBuilding Ansible inventories\nHow to build your inventory\nWorking with dynamic inventory\nPatterns: targeting hosts and groups\nConnection methods and details\nUsing Ansible command line tools\nIntroduction to ad hoc commands\nWorking with command line tools\nAnsible CLI cheatsheet\nUsing Ansible playbooks\nAnsible playbooks\nWorking with playbooks\nExecuting playbooks\nAdvanced playbook syntax\nManipulating data\nProtecting sensitive data with Ansible vault\nAnsible Vault\nManaging vault passwords\nEncrypting content with Ansible Vault\nUsing encrypted variables and files\nConfiguring defaults for using encrypted content\nWhen are encrypted files made visible?\nFormat of files encrypted with Ansible Vault\nUsing Ansible modules and plugins\nIntroduction to modules\nModule maintenance and support\nRejecting modules\nWorking with plugins\nModules and plugins index\nUsing Ansible collections\nInstalling collections\nRemoving a collection\nDownloading collections\nListing collections\nVerifying collections\nUsing collections in a playbook\nCollections index\nUsing Ansible on Windows and BSD\nSetting up a Windows Host\nUsing Ansible and Windows\nWindows Remote Management\nDesired State Configuration\nWindows performance\nWindows Frequently Asked Questions\nManaging BSD hosts with Ansible\nAnsible tips and tricks\nGeneral tips\nPlaybook tips\nInventory tips\nExecution tricks\nSample Ansible setup\n\nContributing to Ansible\n\nAnsible Community Guide\nGetting started\nContributor path\nAnsible Collections Contributor Guide\nThe Ansible Collections Development Cycle\nRequesting changes to a collection\nCreating your first collection pull request\nTesting Collection Contributions\nReview checklist for collection PRs\nAnsible community package collections requirements\nGuidelines for collection maintainers\nContributing to Ansible-maintained Collections\nAnsible Community Steering Committee\nContributing to the Ansible Documentation\nOther Tools and Programs\nWorking with the Ansible collection repositories\nansible-core Contributors Guide\nReporting bugs and requesting features\nContributing to the Ansible Documentation\nThe Ansible Development Cycle\nOther Tools and Programs\nWorking with the Ansible repo\nAdvanced Contributor Guide\nCommitters Guidelines\nRelease Manager Guidelines\nGitHub Admins\nAnsible documentation style guide\nLinguistic guidelines\nreStructuredText guidelines\nMarkdown guidelines\nAccessibility guidelines\nMore resources\n\nExtending Ansible\n\nDeveloper Guide\nAdding modules and plugins locally\nShould you develop a module?\nDeveloping modules\nContributing your module to an existing Ansible collection\nConventions, tips, and pitfalls\nAnsible and Python 3\nDebugging modules\nModule format and documentation\nAdjacent YAML documentation files\nWindows module development walkthrough\nCreating a new collection\nTesting Ansible\nThe lifecycle of an Ansible module or plugin\nDeveloping plugins\nDeveloping dynamic inventory\nDeveloping ansible-core\nAnsible module architecture\nPython API\nRebasing a pull request\nUsing and developing module utilities\nAnsible collection creator path\nDeveloping collections\nMigrating Roles to Roles in Collections on Galaxy\nCollection Galaxy metadata structure\nAnsible architecture\n\nCommon Ansible Scenarios\n\nLegacy Public Cloud Guides\nNetwork Technology Guides\nVirtualization and Containerization Guides\n\nNetwork Automation\n\nNetwork Getting Started\nBasic Concepts\nHow Network Automation is Different\nRun Your First Command and Playbook\nBuild Your Inventory\nUse Ansible network roles\nBeyond the basics\nWorking with network connection options\nResources and next steps\nNetwork Advanced Topics\nNetwork Resource Modules\nAnsible Network Examples\nParsing semi-structured text with Ansible\nValidate data against set criteria with Ansible\nNetwork Debug and Troubleshooting Guide\nWorking with command output and prompts in network modules\nAnsible Network FAQ\nPlatform Options\nNetwork Developer Guide\nDeveloping network resource modules\nDeveloping network plugins\nDocumenting new network platforms\n\nAnsible Galaxy\n\nGalaxy User Guide\nFinding collections on Galaxy\nFinding roles on Galaxy\nInstalling roles from Galaxy\nGalaxy Developer Guide\nCreating collections for Galaxy\nCreating roles for Galaxy\n\nReference & Appendices\n\nCollection Index\nIndexes of all modules and plugins\nPlaybook Keywords\nReturn Values\nAnsible Configuration Settings\nControlling how Ansible behaves: precedence rules\nYAML Syntax\nPython 3 Support\nInterpreter Discovery\nReleases and maintenance\nTesting Strategies\nSanity Tests\nFrequently Asked Questions\nGlossary\nAnsible Reference: Module Utilities\nSpecial Variables\nRed Hat Ansible Automation Platform\nAnsible Automation Hub\nLogging Ansible output\n\nRoadmaps\n\nAnsible Roadmap\nAnsible project 9.0\nAnsible project 8.0\nAnsible project 7.0\nAnsible project 6.0\nAnsible project 5.0\nAnsible project 4.0\nAnsible project 3.0\nAnsible project 2.10\nOlder Roadmaps\nansible-core Roadmaps\nAnsible-core 2.17\nAnsible-core 2.16\nAnsible-core 2.15\nAnsible-core 2.14\nAnsible-core 2.13\nAnsible-core 2.12\nAnsible-core 2.11\nAnsible-base 2.10\nNext \n\n© Copyright Ansible project contributors. Last updated on Dec 12, 2023.\n\nSearch this site" } ]

supermario-ai commented 5 months ago

When our generation dies off, who will be around to tell you to RTFD? 🤣🤣

https://github.com/BuilderIO/gpt-crawler/blob/788001f96e7858c440c1078c92de3eb75fa76374/README.md?plain=1#L52-L93

~30% of the issues in this repo are people not RTFDing.

supermario-ai commented 5 months ago

@steve8708 not a code issue. Match != Exclude. Most peeps aint gonna know that unless that parse the README, and I see the rest of the configs in there. I saw a few issues related to Login, but I suspect that's enablement related. Lastly, one run record hit yesterday ~9700 pages. ❤️ builder.io