BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.16k stars 1.88k forks source link

Turning a website into json data doesn't make the GPT more useful. #21

Closed sudo888samewick closed 7 months ago

sudo888samewick commented 7 months ago

There is a similar project, the general idea is to write all your local files (tree structure) into an output json file, record the full path of each file as the json key, and the file content as the value.

However, I found that doing so did not make the GPT application any smarter. Because the context length of GPT is limited, if the amount of data is relatively large (in fact, just a few copies of HTML can be achieved), the model will be difficult to process.

It's good to generate an output.json for a website, but the output.json can be a large file, which is hard for GPT4 to read.

FacadeCloud commented 7 months ago

Yes. GPT just uses Code Interpreter to read excerpts from the JSON file based on keywords. The workaround is to vectorize the crawled files and expose an embeddings API for GPT to query on the fly.

mike24dzy commented 7 months ago

Yes. GPT just uses Code Interpreter to read excerpts from the JSON file based on keywords. The workaround is to vectorize the crawled files and expose an embeddings API for GPT to query on the fly.

But this can't be achieved in building and customizing a GPT in ChatGPT right? I've tried to upload the crawled JSON file while configuring a GPT in ChatGPT, but it seems that it won't use the file at all except for invoking the code interpreter. Coz I previously thought that if you upload a file while configuring the GPT, it will automatically embed the files into vectors and retrieve them if you ask a question that needs it to look up the files. Sorry if I didn't understand this correctly, could somebody help explain?

steve8708 commented 7 months ago

The purpose of this project is custom GPTs or Assistants

When asked questions that exist in the knowledge files, the GPT will search the file for relevant information, and feed those tidbits into the prompt.

I don't know if they publicly document how this works under the hood anywhere, but I think its safe to assume (given how this is commonly implemented) that they are taking the knowledge files, generating embeddings, using a vector search to pull the relevant info, and feed that into the prompt.

That way you don't overflow the token window and can use specific relevant information needed.

In my experience, this has worked really well as intended. ChatGPT hardly knows much about Builder.io, but when fed our docs and other info through these knowledge files via the newly released custom GPTs or assistants API, it is considerably more capable at tasks related to the Builder.io platform

FacadeCloud commented 7 months ago

No, you can't. You have to either configure a GPT that uses a document retrieval API you coded, or code a chatbot like GPT that uses OpenAI API that makes the embeddings for you. You can't have the cake and eat it too. I believe it's a gap that OpenAI will fill in the future.

Zhengyang DU @.***> 于2023年11月22日周三 00:28写道:

Yes. GPT just uses Code Interpreter to read excerpts from the JSON file based on keywords. The workaround is to vectorize the crawled files and expose an embeddings API for GPT to query on the fly.

But this can't be achieved in building and customizing a GPT in ChatGPT right? I've tried to upload the crawled JSON file while configuring a GPT in ChatGPT, but it seems that it won't use the file at all except for invoking the code interpreter. Coz I previously thought that if you upload a file while configuring the GPT, it will automatically embed the files into vectors and retrieve them if you ask a question that needs it to look up the files. Sorry if I didn't understand this correctly, could somebody help explain?

— Reply to this email directly, view it on GitHub https://github.com/BuilderIO/gpt-crawler/issues/21#issuecomment-1821258066, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKT2TOX6XHZ3RF4GUUP35TYFTJEJAVCNFSM6AAAAAA7QZXKYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGI2TQMBWGY . You are receiving this because you commented.Message ID: @.***>