BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.15k stars 1.88k forks source link

Comparison with well-established crawlers #75

Closed dandv closed 7 months ago

dandv commented 7 months ago

How exactly is this project different from an established crawler that would just dump the HTML text into the .html field of a JSON array?

It's got 12k stars, but it lacks basic features like canonicalizing links (see #73) or preserving links (#74).

FTAndy commented 7 months ago

Yeah, the idea of the project is great, but it lacks so many features to make it perfect.

steve8708 commented 7 months ago

absolutely all good if this isn't the right project for you - work is actively underway to keep improving the project for the custom GPTs use case and specific feedback and PRs to improve things is always highly appreciated

dandv commented 7 months ago

Thanks Steve. I understand this is open source, I know how it works. I've made several suggestions already.

I'm simply asking if it wouldn't be more productive to create an output plugin for an establish crawler, than to reinvent the crawling wheel with the only differentiating feature being rather trivial if I understand correctly (outputting the bare text extracted from an HTML element to a JSON file).

steve8708 commented 7 months ago

could you suggest some examples of well established crawlers you think integration with would be better?

This project is built on crawlee which is a pretty robust crawler, but certainly open to better alternatives

razaanstha commented 4 months ago

I guess instead of just returning plain text maybe try something like turndown so it preserves the links and converts html to markdown in well formatted way and highly customizable? You can try the demo here