Guidance on GPT-4 Vision feature

cryptoapebot commented 2 months ago

This isn't an issue so much as just a question. Can I use GPT-4-Turbo-2024-04-09 as the model in the /images endpoint?

OpenAI states that the new GPT-4 + Vision models to get images. https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4

Does anyone have an example doing that? Handling the return?

sashirestela commented 2 months ago

Hi @cryptoapebot here you have a successful example of vision+stream with that model. Two versions:

Demo for external image
Demo for a local image

This code runs using the simple-openai library:

package io.github.sashirestela.openai.playground;

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Base64;
import java.util.List;

import io.github.sashirestela.openai.OpenAI;
import io.github.sashirestela.openai.SimpleOpenAI;
import io.github.sashirestela.openai.domain.chat.ChatRequest;
import io.github.sashirestela.openai.domain.chat.content.ContentPartImage;
import io.github.sashirestela.openai.domain.chat.content.ContentPartText;
import io.github.sashirestela.openai.domain.chat.content.ImageUrl;
import io.github.sashirestela.openai.domain.chat.message.ChatMsgUser;

public class DemoVision {

    private SimpleOpenAI openai;
    private OpenAI.ChatCompletions chatService;

    public DemoVision() {
        openai = SimpleOpenAI.builder()
                .apiKey(System.getenv("OPENAI_API_KEY"))
                .build();
        chatService = openai.chatCompletions();
    }

    public void demoCallChatWithVisionExternalImage() {
        var chatRequest = ChatRequest.builder()
                .model("gpt-4-turbo-2024-04-09")
                .messages(List.of(
                        new ChatMsgUser(List.of(
                                new ContentPartText(
                                        "What do you see in the image? Give in details in no more than 100 words."),
                                new ContentPartImage(new ImageUrl(
                                        "https://upload.wikimedia.org/wikipedia/commons/e/eb/Machu_Picchu%2C_Peru.jpg"))))))
                .temperature(0.0)
                .maxTokens(500)
                .build();
        var chatResponse = chatService.createStream(chatRequest).join();
        chatResponse.filter(chatResp -> chatResp.firstContent() != null)
                .map(chatResp -> chatResp.firstContent())
                .forEach(System.out::print);
        System.out.println();
    }

    public void demoCallChatWithVisionLocalImage() {
        var chatRequest = ChatRequest.builder()
                .model("gpt-4-turbo-2024-04-09")
                .messages(List.of(
                        new ChatMsgUser(List.of(
                                new ContentPartText(
                                        "What do you see in the image? Give in details in no more than 100 words."),
                                new ContentPartImage(loadImageAsBase64("src/main/resources/machupicchu.jpg"))))))
                .temperature(0.0)
                .maxTokens(500)
                .build();
        var chatResponse = chatService.createStream(chatRequest).join();
        chatResponse.filter(chatResp -> chatResp.firstContent() != null)
                .map(chatResp -> chatResp.firstContent())
                .forEach(System.out::print);
        System.out.println();
    }

    private ImageUrl loadImageAsBase64(String imagePath) {
        try {
            Path path = Paths.get(imagePath);
            byte[] imageBytes = Files.readAllBytes(path);
            String base64String = Base64.getEncoder().encodeToString(imageBytes);
            var extension = imagePath.substring(imagePath.lastIndexOf('.') + 1);
            var prefix = "data:image/" + extension + ";base64,";
            return new ImageUrl(prefix + base64String);
        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }

    public static void main(String[] args) {
        var demoVision = new DemoVision();
        demoVision.demoCallChatWithVisionExternalImage();
        demoVision.demoCallChatWithVisionLocalImage();
    }
}

sashirestela commented 2 months ago

@cryptoapebot To extend my answer, to generate images you should use the models dall-e-2 and dall-e-3 only. The vision feature (read images and describe them) is attached to the chat completion service and you should use one of the gpt models, including the gpt-4-turbo-2024-04-09. You can take a look at this OpenAI model endpoint compatibility table:

https://platform.openai.com/docs/models/model-endpoint-compatibility

xyifhgvnlo286 commented 2 months ago

[openai4j]（ https://github.com/Lambdua/openai4j ）It is a fork in this library that already supports gpt4 vision

final List<ChatMessage> messages = an ArrayList<>();
final ChatMessage systemMessage = a SystemMessage("You are a helpful assistant.");
//Here, the imageMessage is intended for image recognition
final ChatMessage imageMessage = UserMessage.buildImageMessage("What's in this image?",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg");
        messages.add(systemMessage);
        messages.add(imageMessage);
ChatCompletionRequest chatCompletionRequest = ChatCompletionRequest.builder()
        .model("gpt-4-turbo")
        .messages(messages)
        .n(1)
        .maxTokens 200)
        .build();
ChatCompletionChoice choice = service.createChatCompletion(chatCompletionRequest).getChoices().get(0);
        System.out.println(choice.getText());

cryptoapebot commented 2 months ago

Thank you!

TheoKanning / openai-java

Guidance on GPT-4 Vision feature #495