chenxwh / cog-whisper

MIT License
81 stars 28 forks source link

both m4a and mp4 audio files aren't fully transcibed #2

Closed anotherjesse closed 1 year ago

anotherjesse commented 2 years ago

transcribing iPhone's voice memos directly from their native m4a format didn't work

It transcribed about half of my 25 minute memo. (if you have it output the timestamps you can see it tries to read later audio but only transcribes ...

If I convert it to an mp3 before sending it to cog-whisper (or the timestamp version), it succeeds.

Similarly someone showed up in discord with an issue with mp4 files being truncated

anotherjesse commented 2 years ago

running ffprobe from within the cjwbw/whisper container shows:

ffprobe version 4.2.7-0ubuntu0.1 Copyright (c) 2007-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

(snip)

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'whisper.mp4':
  Metadata:
    major_brand     : iso5
    minor_version   : 1
    compatible_brands: isomiso5hlsf
    creation_time   : 2022-09-25T17:48:04.000000Z
  Duration: 00:00:00.98, start: 0.000000, bitrate: 1096 kb/s
    Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, mono, fltp, 1081 kb/s (default)
    Metadata:
      creation_time   : 2022-09-25T17:48:04.000000Z
      handler_name    : Core Media Audio
anotherjesse commented 2 years ago

whereas on a more recent release via homebrew:

$ ffprobe whisper.mp4
ffprobe version 5.1.1 Copyright (c) 2007-2022 the FFmpeg developers
  built with Apple clang version 13.1.6 (clang-1316.0.21.2.5)

(snip)

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'whisper.mp4':
  Metadata:
    major_brand     : iso5
    minor_version   : 1
    compatible_brands: isomiso5hlsf
    creation_time   : 2022-09-25T17:48:04.000000Z
  Duration: 00:00:05.76, start: 0.000000, bitrate: 186 kb/s
  Stream #0:0[0x1](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, mono, fltp, 184 kb/s (default)
    Metadata:
      creation_time   : 2022-09-25T17:48:04.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
anotherjesse commented 2 years ago

Unfortunately for the ubuntu version cog containers use (focal), the version installed is the most recent packaged version: https://launchpad.net/ubuntu/+source/ffmpeg

anotherjesse commented 2 years ago

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

It looks like nvidia has docker containers for ubuntu22.04

and https://github.com/replicate/cog/blob/main/docs/yaml.md references "the nvidia-docker base image"

it might be more valuable to test allowing cog to build with a more recent ubuntu vs exploring backport of ffmpeg

anotherjesse commented 2 years ago

https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=ubuntu22.04 seems to have support for a newer version of cudnn (11.7 series vs 11.3 series)

As I local test I added to cuda_base_image_tags.json:

+  "11.7.1-cudnn8-devel-ubuntu22.04",

Which means you have to update the dockerfile generator:

-       python-openssl \
+       python3-openssl \

and then tell the whisper cog.yaml to use cuda: 11.7.1

Started a cog build to try out later

anotherjesse commented 2 years ago
$ cog predict cog-cog-whisper -i audio=@whisper.mp4

Starting Docker image cog-cog-whisper and running setup()...
Running prediction...
Transcribe with base model
{
  "detected_language": "english",
  "transcription": " I am just trying the recorder out with my mobile phone because why not?"
}
Stopping container...

yay - at least for a small 5 second sample, it worked with the larger model

I'm pushing a build to https://replicate.com/anotherjesse/whisper-updated to allow others to test if they wish.

anotherjesse commented 1 year ago

This seems to be fixed with the latest fixes by @chenxwh