centminmod commented 1 year ago

Great work @enricoros on latest updates! I'm trying to understand why the app and OpenAI AI token limit utilisation works the way it does for your app. When you hit 4097 token limit we need to remove previous messages. But why doesn't https://chat.openai.com/chat implementation of GPT3.5 don't need to do so? Is their method of implementation possible for your app?

[OpenAI Issue] Error: 400 · Bad Request · {"error":{"message":"This model's maximum context length is 4097 tokens. However, your messages resulted in 6412 tokens. Please reduce the length of the messages.","type":"invalid_request_error","param":"messages","code":"context_length_exceeded"}}

enricoros commented 1 year ago

Do you know what's their method of implementation? That's a very interesting concept.

Possibly the chat that's submitted includes the system message, and then walks back from the last message up to when the context window is full. I could actually do this.

Let me know if you can reverse engineer how they do it, before I take a stab.

centminmod commented 1 year ago

Yeah I'm unfamiliar with how OpenAI's ChatGPT 3.5 works so can't help there. But I asked ChatGPT Plus GPT4 the question and it gave me this :)

Q1

Q. How does https://chat.openai.com/chat GPT3.5 allow continous chat conversations even after reaching 4097 token limit while other app implementations using OpenAI API hit a limit at 4097 tokens and require deleting previous messages first

A. The OpenAI API has a 4096 token limit per request, which means that any message longer than 4096 tokens will need to be broken up into multiple requests. However, the web-based chat interface at https://chat.openai.com/chat uses a technique called "streaming" to allow for continuous conversations that exceed the 4096 token limit.

Streaming involves breaking up the conversation into smaller "chunks" and sending them to the API one at a time, while keeping the connection between the client (i.e. the web browser) and the server (i.e. the OpenAI API) open. As new chunks are received from the server, they are displayed in the chat interface in real-time, allowing for a seamless conversation experience.

In the case of the OpenAI chat interface, each chunk is limited to 2048 tokens, which is half of the API limit. When a user types a message that exceeds 2048 tokens, it is broken up into multiple chunks and sent to the server one chunk at a time. The server processes each chunk and sends back a response, which is displayed in the chat interface. This process continues until the conversation is complete.

By using this streaming approach, the OpenAI chat interface is able to provide a seamless conversation experience, even for long conversations that exceed the 4096 token limit. However, it's worth noting that this approach requires a persistent connection between the client and server, which may not be feasible for all applications depending on the resources available.

Q2

Q. Can you provide more technical details of OpenAI Streaming method

A. Sure, here are some additional technical details about how OpenAI's streaming method works:

When a user types a message in the chat interface, the message is initially stored in a local buffer in the user's web browser. The buffer is then split into smaller "chunks" of text, with each chunk containing up to 2048 tokens.

The first chunk is then sent to the OpenAI API using a WebSocket connection. A WebSocket is a protocol for creating a persistent, bidirectional communication channel between a client and a server. The WebSocket allows the client (i.e. the user's web browser) to send data to the server (i.e. the OpenAI API) and receive data from the server in real-time.

When the OpenAI API receives the first chunk of the user's message, it processes it and sends a response back to the client over the same WebSocket connection. The response contains the first part of the AI's generated text, based on the input chunk.

The client then displays the response in the chat interface and sends the next chunk of the user's message to the API using the same WebSocket connection. This process repeats until the entire user message has been sent to the API and the AI has generated a complete response.

Throughout the conversation, the WebSocket connection between the client and the OpenAI API remains open, allowing for real-time communication without the need for multiple API requests. This approach minimizes the latency and overhead associated with traditional REST API requests and allows for a more seamless user experience.

It's worth noting that the use of WebSockets and streaming requires additional server-side infrastructure to handle the persistent connections and manage the flow of data. However, for applications that require real-time communication and support for long messages, this approach can provide significant benefits in terms of performance and user experience.

Q3

Q. Can such streaming be done in a NextJS/NodeJS web app on Vercel or Cloudflare Pages?

A. Yes, it's possible to implement streaming in a NextJS/NodeJS web app hosted on platforms like Vercel or Cloudflare Pages.

To implement streaming, you would need to use a WebSocket library like Socket.io or WebSocket-Node to create a persistent connection between the client and the server. You would then need to configure the server to handle incoming WebSocket requests and send responses back to the client in real-time.

In a NextJS/NodeJS web app, you can create a WebSocket server using the ws package, which is a popular WebSocket library for Node.js. To handle incoming WebSocket requests, you can define a WebSocket endpoint in your server code and use the ws library to create a WebSocket server instance. Here's an example of what this might look like:

const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  console.log('Client connected');

  ws.on('message', (message) => {
    console.log(`Received message: ${message}`);

    // Process the message and send a response back to the client
    const response = processMessage(message);
    ws.send(response);
  });

  ws.on('close', () => {
    console.log('Client disconnected');
  });
});

In this example, we create a WebSocket server instance on port 8080 and define an event handler for incoming connections. When a client connects, we log a message to the console and define event handlers for incoming messages and connection closures. When a message is received, we process it and send a response back to the client using the ws.send() method.

Once you have set up the WebSocket server, you can use a WebSocket client library like Socket.io or WebSocket-Node on the client side to establish a WebSocket connection and start sending and receiving messages in real-time.

Overall, implementing streaming in a NextJS/NodeJS web app hosted on platforms like Vercel or Cloudflare Pages requires some additional setup and infrastructure compared to a traditional REST API. However, it can provide significant benefits in terms of performance and user experience, especially for applications that require real-time communication and support for long messages.

Q4

Q. What additional setup and infrastructure needed for Vercel and Cloudflare Pages for Streaming setup

A. To implement streaming in a NextJS/NodeJS web app hosted on Vercel or Cloudflare Pages, you would need to set up a WebSocket server and ensure that your hosting platform supports WebSocket connections.

Here are the steps you would need to follow:

Set up a WebSocket server: You can use a library like ws or Socket.io to create a WebSocket server in your NextJS/NodeJS app. You would define an endpoint that listens for WebSocket connections and handles incoming messages. The server would then use the WebSocket connection to send responses back to the client in real-time.

Configure your hosting platform to support WebSocket connections: Vercel and Cloudflare Pages both support WebSocket connections, but you may need to configure your app to use a WebSocket-compatible hosting plan. For example, on Vercel, you would need to use the "Pro" or "Enterprise" hosting plans to enable WebSocket connections. On Cloudflare Pages, you would need to configure your app to use a WebSocket route that maps to your WebSocket endpoint.

Use a WebSocket client library: On the client side, you would use a WebSocket client library like Socket.io or WebSocket-Node to establish a WebSocket connection and start sending and receiving messages in real-time. You would need to ensure that your client library is compatible with your server-side WebSocket library and that it can connect to your WebSocket endpoint.

Overall, implementing streaming in a NextJS/NodeJS web app hosted on Vercel or Cloudflare Pages requires some additional setup and configuration compared to a traditional REST API. However, with the right libraries and infrastructure in place, you can provide a seamless real-time experience for your users.

enricoros commented 1 year ago

Thanks for digging. I believe this may not be fully correct because it's missing streaming with chunking. I'll inspect the network packets :)

typpo commented 1 year ago

This Twitter thread claims through experimentation that ChatGPT uses an 8k token sliding window, i.e. what @enricoros suggested: https://twitter.com/goodside/status/1598874674204618753

enricoros commented 1 year ago

Thanks for looking into this @typpo. I will probably have it as an option, or figure out (e.g. graying out) how to make it explicit for the user.

enricoros commented 1 year ago

Just had an (obvious) epiphany @typpo @centminmod @fredliubojin This should not be a sliding window, what is filling the context should be the result of an attention operation. Maybe a way you can think about it is by recalling the pinecone feature #9 , but better.

Assume we have a query ready to be submitted by the user, the messages in history should be prioritized by their dot product with the embeddings of the user message - which means that the context will be filled up with things are most relevant first. This will allow, for instance, to use earlier parts of the conversation and remember earlier parts of the conversation if the user implies that.

centminmod commented 1 year ago

@enricoros just found https://www.phind.com/ it has a unique way of consuming longer token/character prompts, 1st prompt input takes 3000 characters max and then split it at 2900 chaacters using my tool at https://slicer.centminmod.com/ and then there is a 2nd input form for additional code and i loaded the rest of split text into it and it managed to consume my script and provide a summary. Not sure how it works, but another idea maybe related :)

enricoros commented 1 year ago

Thanks. We have actually an impressive feature in the codebase, made by @fredliubojin that is only turned on with PDFs right now. It keeps using gpt to compress useful information in the prompt. May that be the way that phind is doing it?

centminmod commented 1 year ago

compressed info sounds interesting. No idea what Phind is doing heh

i added a screenshot in previous reply showing Phind's 2nd additional input box

enricoros / big-AGI

GPT3.5 continuous conversations? #76

Q1

Q2

Q3

Q4