Add file upload to gemini.

0wwafa commented 3 weeks ago

Please add file upload (text, images, pdf, etc)

fjosue4 commented 3 weeks ago

Hey @0wwafa I'll need to check if the API now accepts files that aren't part of the Google Account files, Gemini Vision Pro the last time I checked required to be signed in to Upload files which were not accepted just with the API required on this basic app.

If files can be passed without OAuth 2.0 I can add that feature.

I'll make sure to keep you posted.

0wwafa commented 3 weeks ago

https://ai.google.dev/api/files

myfile = genai.upload_file(media / "poem.txt")
file_name = myfile.name
print(file_name)  # "files/*"

myfile = genai.get_file(file_name)
print(myfile)

@fjosue4

document = genai.upload_file(path=media / "a11.txt")
model_name = "gemini-1.5-flash-001"
cache = genai.caching.CachedContent.create(
    model=model_name,
    system_instruction="You are an expert analyzing transcripts.",
    contents=[document],
)
print(cache)

model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("Please summarize this transcript")
print(response.text)

fjosue4 commented 3 weeks ago

@0wwafa I just tested it but no luck TypeError: (0 , import_fs.readFileSync) is not a function

Seems like it's not browser-based.

I'll be testing the upcoming days with this other documentation: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini

Update 2: This endpoint seems to exist https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=${apiKey}

But not able yet to get a compatible way to send the files.

0wwafa commented 3 weeks ago

// Make sure to include these imports:
// import { GoogleAIFileManager } from "@google/generative-ai/server";
// import { GoogleGenerativeAI } from "@google/generative-ai";
const fileManager = new GoogleAIFileManager(process.env.API_KEY);

const uploadResult = await fileManager.uploadFile(
  `${mediaPath}/jetpack.jpg`,
  {
    mimeType: "image/jpeg",
    displayName: "Jetpack drawing",
  },
);
// View the response.
console.log(
  `Uploaded file ${uploadResult.file.displayName} as: ${uploadResult.file.uri}`,
);

const genAI = new GoogleGenerativeAI(process.env.API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });
const result = await model.generateContent([
  "Tell me about this image.",
  {
    fileData: {
      fileUri: uploadResult.file.uri,
      mimeType: uploadResult.file.mimeType,
    },
  },
]);
console.log(result.response.text());

fjosue4 commented 3 weeks ago

Nope @0wwafa we need to use the endpoint as the GoogleAIFileManager is not compatible with the browser.

Here's what ChatGPT still says about this error:

0wwafa commented 3 weeks ago

check the rest api. https://ai.google.dev/api/files#files_get-SHELL

0wwafa commented 3 weeks ago

I tested it both in python both in nodejs both in shell with CURL and they all work.

0wwafa commented 3 weeks ago

this also works:

const url = `https://generativelanguage.googleapis.com/v1beta/models?key=${apiKey}`;

fetch(url, {
    method: 'GET',
})
.then(response => response.json())
.then(data => {
    console.log(data);
})
.catch(error => {
    console.error('Error:', error);
});

0wwafa commented 3 weeks ago

and also this: const url = `https://generativelanguage.googleapis.com/v1beta/files?key=${apiKey}`;

0wwafa commented 3 weeks ago

hmm I see the problem.. when doing a POST to upload the file it seems there is a problem:

No 'Access-Control-Allow-Origin' header is present on the requested resource.

but that can be managed from the back-end with a small nodejs or python program...

0wwafa commented 3 weeks ago

Yep.. it must be done in the back-end.. in nodejs:

https://ai.google.dev/api/files#files_create_text-JAVASCRIPT

// Make sure to include these imports:
// import { GoogleAIFileManager } from "@google/generative-ai/server";
// import { GoogleGenerativeAI } from "@google/generative-ai";
const fileManager = new GoogleAIFileManager(process.env.API_KEY);

const uploadResult = await fileManager.uploadFile(`${mediaPath}/a11.txt`, {
  mimeType: "text/plain",
  displayName: "Apollo 11",
});
// View the response.
console.log(
  `Uploaded file ${uploadResult.file.displayName} as: ${uploadResult.file.uri}`,
);

const genAI = new GoogleGenerativeAI(process.env.API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });
const result = await model.generateContent([
  "Transcribe the first few sentences of this document.",
  {
    fileData: {
      fileUri: uploadResult.file.uri,
      mimeType: uploadResult.file.mimeType,
    },
  },
]);
console.log(result.response.text());

fjosue4 commented 3 weeks ago

Yes, it runs correctly on NodeJS, but this UI is browser-based with a pure frontend that's why it's not working directly and requires a similar endpoint as the one I passed you but files won't be stored because Google doesn't like it that way.

https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=${apiKey}

0wwafa commented 3 weeks ago

an alternative is the inlining:

      "parts":[
        {
          "inline_data": {
            "mime_type":"text/plain",
            "data": "'$(base64 $B64FLAGS a11.txt)'"
          }
        }
      ],

the are many mime types accepted including: pdf, png, text, mp3, wmv, mp4 etc

0wwafa commented 3 weeks ago

The real problem with the web api is that every time you prompt the model you are forced to send everything (all the history etc). I opened a bug report on geminiai about this. Even cached content does not work because they use the "fs" api, but perhaps there could be a workaround.

fjosue4 commented 3 weeks ago

Yes, that's a problem because all chats at least on this app are stored in LocalStorage to provide context to Gemini, even with this sometimes it reads the message and responds something incorrectly because it got lost reading all historical messages, so storing also files would be a huge memory problem we would be sending all files all the time, with text it's hard to get it full but with files using base64 will crash the app shortly.

0wwafa commented 3 weeks ago

memory? gemini flash has 1M token context! and the base64 inlining works. subsequently (chatting) the image can be removed leaving its answers on the image.. this works perfectly also on aistudio. please consider the inlining I posted above..

0wwafa commented 3 weeks ago

I just found out that it's even simpler!!!

"parts":[{"text": "BASE64DATA"}]

it automaticalyy analyze them!!

>       "contents": [{
>         "parts":[{"text": "ewogICJkZXBlbmRlbmNpZXMiOiB7CiAgICAiQGdvb2dsZS1haS9nZW5lcmF0aXZlbGFuZ3VhZ2UiOiAiXjIuNS4wIiwKICAgICJAZ29vZ2xlL2dlbmVyYXRpdmUtYWkiOiAiXjAuMTEuMyIsCiAgICAiY3J5cHRvLWpzIjogIl40LjIuMCIsCiAgICAid3MiOiAiXjguMTcuMCIKICB9Cn0K"},{"text": "Analyze this."}]
>         }]
>        }' 2> /dev/null
{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "This is a JSON object representing dependencies for a project.  Here's a breakdown:\n\n**Structure**\n\n* **\"dependencies\"**: This is the main key that holds all the dependencies.\n* **\"@[dependency name]\"**: Each key within \"dependencies\" represents a specific dependency with its name and version number.\n\n**Dependencies**\n\n* **\"@google-ai/generative-language\"**: This dependency is for a generative language library from Google AI. Its version is \"v2.5.0\".\n* **\"@google/generative-ai\"**: Another library from Google, likely for generative AI tasks. Its version is \"v0.11.3\".\n* **\"crypto-js\"**: A library for working with cryptographic functions. Its version is \"v4.2.0\".\n* **\"ws\"**: This is a library for working with websockets. Its version is \"v8.17.0\".\n\n**Meaning**\n\nThis JSON snippet likely comes from a project's `package.json` file. It defines the software libraries that the project relies on. When installing this project, a package manager (like npm or yarn) will automatically fetch and install these dependencies and their specified versions, ensuring that the project has all the necessary components to run correctly.\n\n**Key points to remember:**\n\n* Dependency management is crucial in software development to ensure consistency and avoid conflicts.\n* Using specific versions (like \"v2.5.0\") is important for maintaining compatibility and preventing unexpected behavior.\n* `package.json` is a standard file used to define project metadata, including dependencies, for Node.js and JavaScript projects. \n"

fjosue4 commented 3 weeks ago

Awesome @0wwafa I'll test passing the base64 inside the content this weekend, I'll let you know how it goes!

About memory, I was talking on the user side (browser) keeping the historical there.

0wwafa commented 3 weeks ago

Awesome @0wwafa I'll test passing the base64 inside the content this weekend, I'll let you know how it goes!

after a few more tests (I passed an image) It didn't work well. I don't know how it really wiorks on the back-end. anyway there are multiple ways to upload files. another simple one is: upload a file to google drive. enable sharing to "whoever has the link" and then add the ID of the file in the conversation.

that is the method aistudio uses. but in aistudio you can also paste an image in the chatbox... the image is passed to the model in base64. I'll get back to you when I find a solid way to do it from a webapp.

0wwafa commented 3 weeks ago

here is how to do it:

    async function fileToGenerativePart(file) {
        const base64EncodedDataPromise = new Promise((resolve) => {
            const reader = new FileReader();
            reader.onloadend = () => resolve(reader.result.split(',')[1]);
            reader.readAsDataURL(file);
        });
        return {
            inlineData: { data: await base64EncodedDataPromise, mimeType: file.type },
        };
    }

0wwafa commented 3 weeks ago

tested and working:

        const data = JSON.stringify({
            contents: [{
                parts: [{
                        inlineData: {
                            mimeType: mimeType,
                            data: fileContent.toString('base64')
                        }
                    },
                    {
                        text: 'Analyze this.'
                    },
                ],
            }, ],
        });

0wwafa commented 3 weeks ago

$ node anal2.js woman_art1.jpg

The painting depicts a woman sitting at a cafe table, her gaze directed downwards, creating a sense of introspection. She is dressed in a vibrant red dress, accentuated by a white top, suggesting a sense of sophistication and femininity. The dress, with its flowing lines, adds a graceful touch to the composition. Her long, flowing hair cascades down her back, framing her face and drawing attention to her features.

The setting is a Parisian cafe, a quintessential location synonymous with romance and art. The background, although blurred, provides a glimpse into the bustling cafe scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene.

The use of light and shadow adds depth and dimension to the painting. The sun casts a warm glow on the woman, highlighting her features and creating a sense of warmth. The shadows, meanwhile, accentuate the lines of the cafe and the figures of the patrons, creating a sense of depth and realism. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene.

The overall mood of the painting is one of contemplation and tranquility. The woman's pensive expression, combined with the relaxed setting of the cafe, creates a sense of peacefulness. The warm colors and the use of light and shadow further enhance this sense of tranquility, inviting the viewer to step into the painting and experience the moment.

The painting is a beautiful representation of a timeless scene, capturing the essence of Parisian cafe culture. It is a testament to the artist's skill in depicting the human figure and creating a sense of realism and beauty. The painting is sure to captivate viewers with its evocative imagery and its ability to transport them to a different time and place. The cafe tables and chairs, along with the figures of other patrons in the distance, contribute to the lively atmosphere of the scene.

0wwafa commented 3 weeks ago

the only restriction is that the payload can be 20971520 bytes maximum.

0wwafa commented 3 weeks ago

it seems to work with many more file types than the ones publicized :D

$ node anal2.js ../spectrogram.html This HTML code creates a web page that visualizes audio input from the user's microphone in the form of a spectrogram. Here's a breakdown of the code and its functionality:

HTML Structure

Basic HTML Structure: The code begins with a standard HTML document structure (<!DOCTYPE html>, <html>, <head>, <body>).
CSS Styling:
- style.css: An external CSS file is linked to the HTML, likely containing additional styles that aren't included in the inline <style> tag.
- Inline Styling: The <style> tag provides basic styling:
  - body: Sets the background color to dark gray, fills the entire viewport, removes margins and padding, and prevents horizontal and vertical scrollbars.
  - canvas: Sets the background color of the canvas element to black.
Overlay Element: The <div class="overlay"> element serves as an initial overlay, covering the entire screen.
- Button: The <button> within the overlay triggers the start of the spectrogram visualization when clicked.
Canvas Element: The <canvas> element is where the spectrogram will be drawn.
JavaScript Script: The <script> tag contains JavaScript code to handle user interactions, audio processing, and the visualization.

JavaScript Functionality

1. Event Listener and Overlay Removal:

An event listener is attached to the button in the overlay:
- document.querySelector(".overlay > button").addEventListener("click", (e) => { ... })
When the button is clicked, the event handler:
- Removes the overlay element from the DOM:
  - overlay.parentNode.removeChild(overlay);
- Calls the startSpectrogram function to initiate the spectrogram visualization.

2. startSpectrogram Function:

Canvas Initialization:
- let canvas = document.querySelector('canvas');
- let ctx = canvas.getContext('2d'); (Gets the 2D drawing context for the canvas)
- let WIDTH = +canvas.width; and let HEIGHT = +canvas.height; (Retrieve the canvas dimensions)
Microphone Constraints:
- let constraints = { audio: { ... } }; (Defines audio constraints for getUserMedia)
- Specifies options to disable features like echo cancellation, noise suppression, and automatic gain control. This ensures raw audio is captured.
Responsive Canvas:
- The code implements a function resizeCanvas to make the canvas responsive to window resizing. It updates the canvas width and height to match its container's size.
- A resizeTimeout variable is used to throttle resize events to avoid frequent re-rendering.
Shifting Image Data:
- The shiftLeft function shifts the canvas image data one pixel to the left, effectively moving the spectrogram visualization to the left to create a "scrolling" effect.
Audio Context and Analyser:
- let audioContext = new AudioContext(); (Creates an audio context, which handles audio processing)
- let analyser = audioContext.createAnalyser(); (Creates an audio analyser node to analyze frequency data)
- analyser.fftSize = 2048; (Sets the size of the Fast Fourier Transform used for analysis)
- analyser.smoothingTimeConstant = 0.0; (Sets the smoothing factor to 0, resulting in no smoothing for raw data)
- let buffer = new Uint8Array(analyser.frequencyBinCount); (Creates a buffer to store frequency data)
Microphone Input:
- navigator.mediaDevices.getUserMedia(constraints) (Requests microphone access from the user)
- The promise resolves to a stream object, which represents the audio input.
- var microphone = audioContext.createMediaStreamSource(stream); (Creates a media stream source from the stream)
- microphone.connect(analyser); (Connects the microphone source to the analyser node)
Drawing the Spectrogram:
- The draw function is responsible for visualizing the spectrogram:
  - It calls shiftLeft to shift the image data to the left.
  - analyser.getByteFrequencyData(buffer) retrieves the frequency data into the buffer.
  - The code calculates the vertical spacing (dy) based on the number of frequency bins.
  - It iterates through the buffer, using the frequency data to set the fill color of rectangular bars.
  - The bars are drawn from right to left, creating a spectrogram visualization.
  - The requestAnimationFrame(draw) schedules the draw function to be called repeatedly for smooth animation.

Overall, this code effectively creates a basic real-time audio spectrogram visualization using the Web Audio API and canvas drawing.

Improvements and Potential Features:

Color Schemes: Implement different color schemes to enhance the visual appeal and potentially represent different frequency ranges with distinct colors.
Frequency Range Visualization: Display specific frequency ranges on the spectrogram, allowing the user to focus on particular audio frequencies.
User Interface: Add controls for adjusting parameters like FFT size, smoothing, and color mapping.
Audio Output: Implement audio output, allowing the user to hear the processed audio alongside the spectrogram visualization.
Interactive Features: Add features like clicking on the spectrogram to highlight or zoom into specific frequency bands.

fjosue4 commented 3 weeks ago

Great, thanks for sharing your finds! I'll let you know how my testing goes.

0wwafa commented 2 weeks ago

Great, thanks for sharing your finds! I'll let you know how my testing goes.

it works! you can add it...

only caveat: not all mime types are supported. and for code or text files it's better to put the code or text file in the message as it is than passing it as an inlined file.

0wwafa commented 2 weeks ago

This one he messed up.. but the answer is almost right:

0wwafa commented 2 weeks ago

Note: for files who don't have a known mime type or that are unaccepted, just use their ascii representation. If you pass them as application/* they will probably be refused. If you need I can send you the code of the "file analyzer". It's a static page.

fjosue4 commented 2 weeks ago

I tested it last night, and it mostly gets errors and sometimes a file reading. I'll send an update with a selector to choose between gemini-1.5-flash and gemini-pro for you to test it, and check if there's a problem on the API call according to what you tested.

fjosue4 commented 2 weeks ago

@0wwafa I've merged the changes, if the model is Flash you can select files.

There are a few bugs I can fix later this weekend like not clearing the files after sending the prompt but you should be good to play with it and give feedback or suggest fixes for processing files.

0wwafa commented 2 weeks ago

I tested it last night, and it mostly gets errors and sometimes a file reading. I'll send an update with a selector to choose between gemini-1.5-flash and gemini-pro for you to test it, and check if there's a problem on the API call according to what you tested.

The rest api is tricky. But I finished now the file analyzer which uses the rest api and it works beautifully. All supported file types (all audio types all video types and all document types) work! The only limit is that the payload can't exceed 20971520.

fjosue4 commented 2 weeks ago

It's great to know that you got the analyzer working!

Let me know if you test the update I sent, if you want to include part of your analyzer to improve passing the base64 feel free to share the code or open a Pull Request

0wwafa commented 2 weeks ago

It's great to know that you got the analyzer working!

Let me know if you test the update I sent, if you want to include part of your analyzer to improve passing the base64 feel free to share the code or open a Pull Request

I will publish my code when it will be "decent" :D As of now it's working beautifully using the streaming api (which is a mess). It started as a proof of concept to show you how it could be done, now it's a standalone program of 600 hand written lines with only one library imported (markdown.js)

fjosue4 / google-gemini-ui

Add file upload to gemini. #14