Open MrCsabaToth opened 2 months ago
Maybe also provide audio generation capability with AudioFX? That would be a new interaction mode as well.
We should implement this as a part of the multi modal input screen. Currently it is able to record an image, soon it'll be able include more (#37) and also record videos (#43). The user will also able to attach photos (or other modalities?). We should also allow on that screen to record audio. Then the user can decide what mixture of modalities they want to supply to the LLM.
The record
plugin example records as an AAC LC, and it's in an m4a container!
Furthermore, the mime type plugin unfortunately identifies that as a video!
MIME for /data/user/0/dev.csaba.inspector_gadget.dev/app_flutter/audio_1726193046177.aac: video/mp4
whereas Ogg Opus is unsurprisingly identified correctly:
MIME for /data/user/0/dev.csaba.inspector_gadget.dev/app_flutter/audio_1726191807140.ogg: audio/ogg
mediainfo audio_1726193046177.aac
General
Complete name : audio_1726193046177.aac
Format : MPEG-4
Format profile : Base Media / Version 2
Codec ID : mp42 (isom/mp42)
File size : 107 KiB
Duration : 6 s 664 ms
Overall bit rate mode : Constant
Overall bit rate : 132 kb/s
Encoded date : 2024-09-13 02:04:13 UTC
Tagged date : 2024-09-13 02:04:13 UTC
com.android.version : 14
com.android.manufacturer : motorola
com.android.model : motorola razr 2022
FileExtension_Invalid : braw mov mp4 m4v m4a m4b m4p m4r 3ga 3gpa 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma ismt f4a f4b f4v
Audio
ID : 1
Format : AAC LC
Format/Info : Advanced Audio Codec Low Complexity
Codec ID : mp4a-40-2
Duration : 6 s 664 ms
Source duration : 6 s 687 ms
Bit rate mode : Constant
Bit rate : 128 kb/s
Channel(s) : 1 channel
Channel layout : M
Sampling rate : 44.1 kHz
Frame rate : 43.066 FPS (1024 SPF)
Compression mode : Lossy
Stream size : 104 KiB (97%)
Source stream size : 104 KiB (97%)
Title : SoundHandle
Language : English
Encoded date : 2024-09-13 02:04:13 UTC
Tagged date : 2024-09-13 02:04:13 UTC
mdhd_Duration : 6664
Errors : Missing ID_END
Conformance errors : 1
AAC : Yes
General compliance : Bitstream parsing ran out of data to read before the end of the syntax was reached, most probably the bitstream is malformed (frame 0, time -00:00:00.023, offset 0xD35)
mediainfo audio_1726123895553.ogg
General
Complete name : audio_1726123895553.ogg
Format : Ogg
File size : 95.8 KiB
Duration : 11 s 940 ms
Overall bit rate : 65.7 kb/s
Audio
ID : 1825418627 (0x6CCDAD83)
Format : Opus
Duration : 11 s 940 ms
Channel(s) : 1 channel
Channel layout : M
Sampling rate : 24.0 kHz
Compression mode : Lossy
Writing library : libopus
I displayed a modal alert dialog while the recording is going on; this is for simplicity. I instantiate the recording before invoking the modal and stops it after dismissal. It seems that the separate routing level and UI loop interferes with the recording, so I probably have to convert the modal alert into a bottom sheet and do the work in the sheet's widget.
Converting the modal to bottom sheet helped, now the music records OK. However attaching it to the LLM request doesn't seem to work:
Now the models flat out refuse to process audio files. https://discuss.ai.google.dev/t/gemini-1-5-refuses-to-process-audio-files/39713
Since the video modality works, maybe the workaround is to augment a video container with a blank video stream (which can be extremely low bitrate due to its nature) and attach the audio stream to it? Seems like that ffmpeg can do that:
In Flutter land I wanted to use FFProbe for MIME determination before (https://github.com/CsabaConsulting/InspectorGadgetApp/issues/43#issuecomment-2339816368), but it might be able to perform such a trick: https://pub.dev/packages/ffmpeg_kit_flutter
Allen Firstenberg pointed out that what I am missing is probably the File Upload API: https://discuss.ai.google.dev/t/gemini-1-5-refuses-to-process-audio-files/39713/5?u=tocsa.
I'm using DataPart
(inline data https://github.com/google-gemini/generative-ai-dart/blob/ec5a820166fdb05fb5b387efab31eccce9d4072f/pkgs/google_generative_ai/lib/src/content.dart#L113), however I would probably need to use FilePart
(https://github.com/google-gemini/generative-ai-dart/blob/ec5a820166fdb05fb5b387efab31eccce9d4072f/pkgs/google_generative_ai/lib/src/content.dart#L123). The gotcha is that even though the FilePart
is part of the package, the Google AI File Service API is nowhere to be found: https://github.com/google-gemini/generative-ai-dart/issues/211
We'll need to switch to https://pub.dev/packages/firebase_vertexai/ Firebase Flutter has file upload support for a long time now.
This would be a new fourth major interaction mode besides voice chat, image chat, and translation. The user may want to record two audio snippets. The first snippet is a sample and passed along as a voice modality, followed by the second content part which is a transcribed voice instruction.
Currently someone cannot request a music recognition because all recorded audio is transcribed. This mode opens up Shazam like function or more.