BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.16k stars 1.88k forks source link

Add help file to crawl github repos #51

Open zackees opened 7 months ago

zackees commented 7 months ago

I would love to create a gpt out of a github repo. Can you please add this?

K thx bai

haydenwhayne commented 7 months ago

@zackees i found it very difficult to get it to work with github repos, so i actually created my own repo based on this focused on crawling GitHub repos using GitHub's api if you want to check it out https://github.com/phloai/gpt-github-crawler

nicholascross commented 7 months ago

I was after the same thing and had similar difficulty, I think my problem was it wouldn't atomically traverse the subfolders, perhaps because the content is not loaded until it is interacted with 🤔🤷‍♂️

I experimented with these selectors #repo-content-turbo-frame, #read-only-cursor-text-area, #repos-file-tree.

Thinking on it a bit more since the repository is fully retrievable perhaps this kind of thing could be done effectively by cloning the repo and then traversing the file system. Perhaps a web crawler is not really required for this.

haydenwhayne commented 7 months ago

@nicholascross Yep I agree, the github repo I linked above allows for crawling both remote and local repos. This way you can clone the repository if you want and run it in local mode to traverse the file system. This would allow you to still add match patterns so you can specify which file and file types you want.

nicholascross commented 7 months ago

I found local filesystem crawling has been requested here so maybe go upvote if you want it.

https://github.com/BuilderIO/gpt-crawler/issues/92

Unlikely to be useful for anyone unless they are a Swift dev experimenting in this space but I ended up going down the local checkout path myself.

https://github.com/nicholascross/SourceCrawler

I found it interesting that once I had the first version of this which used heuristic regexes for type extraction I was able to use the crawling output with a GPT agent to add AST based type extraction using a "third party" library I had no experience with. 🤯

granmoe commented 7 months ago

I ended up creating a repo crawler as well. Mine supports either crawling a public repo based on its URL, or crawling the locally checked out repo:

https://github.com/granmoe/github-repo-gpt-scraper