CsabaConsulting / InspectorGadgetApp

Open Multi-Modal Personal Assistant
MIT License
4 stars 1 forks source link

Add "Shazam mode" #38

Open MrCsabaToth opened 2 months ago

MrCsabaToth commented 2 months ago

This would be a new fourth major interaction mode besides voice chat, image chat, and translation. The user may want to record two audio snippets. The first snippet is a sample and passed along as a voice modality, followed by the second content part which is a transcribed voice instruction.

Currently someone cannot request a music recognition because all recorded audio is transcribed. This mode opens up Shazam like function or more.

MrCsabaToth commented 2 months ago

Maybe also provide audio generation capability with AudioFX? That would be a new interaction mode as well.

MrCsabaToth commented 1 month ago

We should implement this as a part of the multi modal input screen. Currently it is able to record an image, soon it'll be able include more (#37) and also record videos (#43). The user will also able to attach photos (or other modalities?). We should also allow on that screen to record audio. Then the user can decide what mixture of modalities they want to supply to the LLM.

MrCsabaToth commented 1 month ago

The record plugin example records as an AAC LC, and it's in an m4a container! Furthermore, the mime type plugin unfortunately identifies that as a video! MIME for /data/user/0/dev.csaba.inspector_gadget.dev/app_flutter/audio_1726193046177.aac: video/mp4

whereas Ogg Opus is unsurprisingly identified correctly: MIME for /data/user/0/dev.csaba.inspector_gadget.dev/app_flutter/audio_1726191807140.ogg: audio/ogg

mediainfo audio_1726193046177.aac 
General
Complete name                            : audio_1726193046177.aac
Format                                   : MPEG-4
Format profile                           : Base Media / Version 2
Codec ID                                 : mp42 (isom/mp42)
File size                                : 107 KiB
Duration                                 : 6 s 664 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 132 kb/s
Encoded date                             : 2024-09-13 02:04:13 UTC
Tagged date                              : 2024-09-13 02:04:13 UTC
com.android.version                      : 14
com.android.manufacturer                 : motorola
com.android.model                        : motorola razr 2022
FileExtension_Invalid                    : braw mov mp4 m4v m4a m4b m4p m4r 3ga 3gpa 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma ismt f4a f4b f4v

Audio
ID                                       : 1
Format                                   : AAC LC
Format/Info                              : Advanced Audio Codec Low Complexity
Codec ID                                 : mp4a-40-2
Duration                                 : 6 s 664 ms
Source duration                          : 6 s 687 ms
Bit rate mode                            : Constant
Bit rate                                 : 128 kb/s
Channel(s)                               : 1 channel
Channel layout                           : M
Sampling rate                            : 44.1 kHz
Frame rate                               : 43.066 FPS (1024 SPF)
Compression mode                         : Lossy
Stream size                              : 104 KiB (97%)
Source stream size                       : 104 KiB (97%)
Title                                    : SoundHandle
Language                                 : English
Encoded date                             : 2024-09-13 02:04:13 UTC
Tagged date                              : 2024-09-13 02:04:13 UTC
mdhd_Duration                            : 6664
Errors                                   : Missing ID_END
Conformance errors                       : 1
 AAC                                     : Yes
  General compliance                     : Bitstream parsing ran out of data to read before the end of the syntax was reached, most probably the bitstream is malformed (frame 0, time -00:00:00.023, offset 0xD35)
mediainfo audio_1726123895553.ogg 
General
Complete name                            : audio_1726123895553.ogg
Format                                   : Ogg
File size                                : 95.8 KiB
Duration                                 : 11 s 940 ms
Overall bit rate                         : 65.7 kb/s

Audio
ID                                       : 1825418627 (0x6CCDAD83)
Format                                   : Opus
Duration                                 : 11 s 940 ms
Channel(s)                               : 1 channel
Channel layout                           : M
Sampling rate                            : 24.0 kHz
Compression mode                         : Lossy
Writing library                          : libopus
MrCsabaToth commented 1 month ago

I displayed a modal alert dialog while the recording is going on; this is for simplicity. I instantiate the recording before invoking the modal and stops it after dismissal. It seems that the separate routing level and UI loop interferes with the recording, so I probably have to convert the modal alert into a bottom sheet and do the work in the sheet's widget.

MrCsabaToth commented 1 month ago

Converting the modal to bottom sheet helped, now the music records OK. However attaching it to the LLM request doesn't seem to work:

MrCsabaToth commented 1 month ago

Now the models flat out refuse to process audio files. https://discuss.ai.google.dev/t/gemini-1-5-refuses-to-process-audio-files/39713

Since the video modality works, maybe the workaround is to augment a video container with a blank video stream (which can be extremely low bitrate due to its nature) and attach the audio stream to it? Seems like that ffmpeg can do that:

In Flutter land I wanted to use FFProbe for MIME determination before (https://github.com/CsabaConsulting/InspectorGadgetApp/issues/43#issuecomment-2339816368), but it might be able to perform such a trick: https://pub.dev/packages/ffmpeg_kit_flutter

MrCsabaToth commented 1 month ago

Allen Firstenberg pointed out that what I am missing is probably the File Upload API: https://discuss.ai.google.dev/t/gemini-1-5-refuses-to-process-audio-files/39713/5?u=tocsa.

I'm using DataPart (inline data https://github.com/google-gemini/generative-ai-dart/blob/ec5a820166fdb05fb5b387efab31eccce9d4072f/pkgs/google_generative_ai/lib/src/content.dart#L113), however I would probably need to use FilePart (https://github.com/google-gemini/generative-ai-dart/blob/ec5a820166fdb05fb5b387efab31eccce9d4072f/pkgs/google_generative_ai/lib/src/content.dart#L123). The gotcha is that even though the FilePart is part of the package, the Google AI File Service API is nowhere to be found: https://github.com/google-gemini/generative-ai-dart/issues/211

MrCsabaToth commented 1 month ago

We'll need to switch to https://pub.dev/packages/firebase_vertexai/ Firebase Flutter has file upload support for a long time now.