ManimCommunity / manim-voiceover

Manim plugin for all things voiceover
https://voiceover.manim.community/en/stable
MIT License
185 stars 25 forks source link

Numpy ValueError while running RecorderService example #33

Open psmlbhor opened 1 year ago

psmlbhor commented 1 year ago

Description of bug / unexpected behavior

I am trying to run the basic usage example of manim-voiceover given at https://docs.manim.community/en/stable/guides/add_voiceovers.html . When I try to run it, I get the following error:

$ manim -pql voice_over.py --disable_caching
Manim Community v0.17.2

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim/cli/render/c │
│ ommands.py:115 in render                                                                         │
│                                                                                                  │
│   112 │   │   │   try:                                                                           │
│   113 │   │   │   │   with tempconfig({}):                                                       │
│   114 │   │   │   │   │   scene = SceneClass()                                                   │
│ ❱ 115 │   │   │   │   │   scene.render()                                                         │
│   116 │   │   │   except Exception:                                                              │
│   117 │   │   │   │   error_console.print_exception()                                            │
│   118 │   │   │   │   sys.exit(1)                                                                │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim/scene/scene. │
│ py:223 in render                                                                                 │
│                                                                                                  │
│    220 │   │   """                                                                               │
│    221 │   │   self.setup()                                                                      │
│    222 │   │   try:                                                                              │
│ ❱  223 │   │   │   self.construct()                                                              │
│    224 │   │   except EndSceneEarlyException:                                                    │
│    225 │   │   │   pass                                                                          │
│    226 │   │   except RerunSceneException as e:                                                  │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/circle_test/voice_over.py:17 in construct            │
│                                                                                                  │
│   14 │   │   circle = Circle()                                                                   │
│   15 │   │                                                                                       │
│   16 │   │   # Surround animation sections with with-statements:                                 │
│ ❱ 17 │   │   with self.voiceover(text="This circle is drawn as I speak.") as tracker:            │
│   18 │   │   │   self.play(Create(circle), run_time=tracker.duration)                            │
│   19 │   │   │   # The duration of the animation is received from the audio file                 │
│   20 │   │   │   # and passed to the tracker automatically.                                      │
│                                                                                                  │
│ /usr/lib/python3.10/contextlib.py:135 in __enter__                                               │
│                                                                                                  │
│   132 │   │   # they are only needed for recreation, which is not possible anymore               │
│   133 │   │   del self.args, self.kwds, self.func                                                │
│   134 │   │   try:                                                                               │
│ ❱ 135 │   │   │   return next(self.gen)                                                          │
│   136 │   │   except StopIteration:                                                              │
│   137 │   │   │   raise RuntimeError("generator didn't yield") from None                         │
│   138                                                                                            │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/vo │
│ iceover_scene.py:180 in voiceover                                                                │
│                                                                                                  │
│   177 │   │                                                                                      │
│   178 │   │   try:                                                                               │
│   179 │   │   │   if text is not None:                                                           │
│ ❱ 180 │   │   │   │   yield self.add_voiceover_text(text, **kwargs)                              │
│   181 │   │   │   elif ssml is not None:                                                         │
│   182 │   │   │   │   yield self.add_voiceover_ssml(ssml, **kwargs)                              │
│   183 │   │   finally:                                                                           │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/vo │
│ iceover_scene.py:64 in add_voiceover_text                                                        │
│                                                                                                  │
│    61 │   │   │   )                                                                              │
│    62 │   │                                                                                      │
│    63 │   │   dict_ = self.speech_service._wrap_generate_from_text(text, **kwargs)               │
│ ❱  64 │   │   tracker = VoiceoverTracker(self, dict_, self.speech_service.cache_dir)             │
│    65 │   │   self.add_sound(str(Path(self.speech_service.cache_dir) / dict_["final_audio"]))    │
│    66 │   │   self.current_tracker = tracker                                                     │
│    67                                                                                            │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │
│ acker.py:58 in __init__                                                                          │
│                                                                                                  │
│    55 │   │   self.end_t = last_t + self.duration                                                │
│    56 │   │                                                                                      │
│    57 │   │   if "word_boundaries" in self.data:                                                 │
│ ❱  58 │   │   │   self._process_bookmarks()                                                      │
│    59 │                                                                                          │
│    60 │   def _process_bookmarks(self) -> None:                                                  │
│    61 │   │   self.bookmark_times = {}                                                           │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │
│ acker.py:63 in _process_bookmarks                                                                │
│                                                                                                  │
│    60 │   def _process_bookmarks(self) -> None:                                                  │
│    61 │   │   self.bookmark_times = {}                                                           │
│    62 │   │   self.bookmark_distances = {}                                                       │
│ ❱  63 │   │   self.time_interpolator = TimeInterpolator(self.data["word_boundaries"])            │
│    64 │   │   net_text_len = len(remove_bookmarks(self.data["input_text"]))                      │
│    65 │   │   if "transcribed_text" in self.data:                                                │
│    66 │   │   │   transcribed_text_len = len(self.data["transcribed_text"].strip())              │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │
│ acker.py:24 in __init__                                                                          │
│                                                                                                  │
│    21 │   │   │   self.x.append(wb["text_offset"])                                               │
│    22 │   │   │   self.y.append(wb["audio_offset"] / AUDIO_OFFSET_RESOLUTION)                    │
│    23 │   │                                                                                      │
│ ❱  24 │   │   self.f = interp1d(self.x, self.y)                                                  │
│    25 │                                                                                          │
│    26 │   def interpolate(self, distance: int) -> np.ndarray:                                    │
│    27 │   │   try:                                                                               │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/scipy/interpolate/ │
│ _interpolate.py:484 in __init__                                                                  │
│                                                                                                  │
│    481 │   │                                                                                     │
│    482 │   │   # Interpolation goes internally along the first axis                              │
│    483 │   │   self.y = y                                                                        │
│ ❱  484 │   │   self._y = self._reshape_yi(self.y)                                                │
│    485 │   │   self.x = x                                                                        │
│    486 │   │   del y, x  # clean up namespace to prevent misuse; use attributes                  │
│    487 │   │   self._kind = kind                                                                 │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/scipy/interpolate/ │
│ _polyint.py:110 in _reshape_yi                                                                   │
│                                                                                                  │
│   107 │   │   │   ok_shape = "%r + (N,) + %r" % (self._y_extra_shape[-self._y_axis:],            │
│   108 │   │   │   │   │   │   │   │   │   │      self._y_extra_shape[:-self._y_axis])            │
│   109 │   │   │   raise ValueError("Data must be of shape %s" % ok_shape)                        │
│ ❱ 110 │   │   return yi.reshape((yi.shape[0], -1))                                               │
│   111 │                                                                                          │
│   112 │   def _set_yi(self, yi, xi=None, axis=None):                                             │
│   113 │   │   if axis is None:                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: cannot reshape array of size 0 into shape (0,newaxis)

Expected behavior

The example should run correctly and should prompt me for selecting a recording device and then record the audio.

How to reproduce the issue

Code for reproducing the problem ```from manim import * from manim_voiceover import VoiceoverScene from manim_voiceover.services.recorder import RecorderService # Simply inherit from VoiceoverScene instead of Scene to get all the # voiceover functionality. class RecorderExample(VoiceoverScene): def construct(self): # You can choose from a multitude of TTS services, # or in this example, record your own voice: self.set_speech_service(RecorderService()) circle = Circle() # Surround animation sections with with-statements: with self.voiceover(text="This circle is drawn as I speak.") as tracker: self.play(Create(circle), run_time=tracker.duration) # The duration of the animation is received from the audio file # and passed to the tracker automatically. # This part will not start playing until the previous voiceover is finished. with self.voiceover(text="Let's shift it to the left 2 units.") as tracker: self.play(circle.animate.shift(2 * LEFT), run_time=tracker.duration) ```

Additional media files

Images/GIFs

Logs

Terminal output ``` $ manim -v DEBUG -pql voice_over.py --disable_caching Manim Community v0.17.2 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim/cli/render/c │ │ ommands.py:115 in render │ │ │ │ 112 │ │ │ try: │ │ 113 │ │ │ │ with tempconfig({}): │ │ 114 │ │ │ │ │ scene = SceneClass() │ │ ❱ 115 │ │ │ │ │ scene.render() │ │ 116 │ │ │ except Exception: │ │ 117 │ │ │ │ error_console.print_exception() │ │ 118 │ │ │ │ sys.exit(1) │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim/scene/scene. │ │ py:223 in render │ │ │ │ 220 │ │ """ │ │ 221 │ │ self.setup() │ │ 222 │ │ try: │ │ ❱ 223 │ │ │ self.construct() │ │ 224 │ │ except EndSceneEarlyException: │ │ 225 │ │ │ pass │ │ 226 │ │ except RerunSceneException as e: │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/circle_test/voice_over.py:17 in construct │ │ │ │ 14 │ │ circle = Circle() │ │ 15 │ │ │ │ 16 │ │ # Surround animation sections with with-statements: │ │ ❱ 17 │ │ with self.voiceover(text="This circle is drawn as I speak.") as tracker: │ │ 18 │ │ │ self.play(Create(circle), run_time=tracker.duration) │ │ 19 │ │ │ # The duration of the animation is received from the audio file │ │ 20 │ │ │ # and passed to the tracker automatically. │ │ │ │ /usr/lib/python3.10/contextlib.py:135 in __enter__ │ │ │ │ 132 │ │ # they are only needed for recreation, which is not possible anymore │ │ 133 │ │ del self.args, self.kwds, self.func │ │ 134 │ │ try: │ │ ❱ 135 │ │ │ return next(self.gen) │ │ 136 │ │ except StopIteration: │ │ 137 │ │ │ raise RuntimeError("generator didn't yield") from None │ │ 138 │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/vo │ │ iceover_scene.py:180 in voiceover │ │ │ │ 177 │ │ │ │ 178 │ │ try: │ │ 179 │ │ │ if text is not None: │ │ ❱ 180 │ │ │ │ yield self.add_voiceover_text(text, **kwargs) │ │ 181 │ │ │ elif ssml is not None: │ │ 182 │ │ │ │ yield self.add_voiceover_ssml(ssml, **kwargs) │ │ 183 │ │ finally: │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/vo │ │ iceover_scene.py:64 in add_voiceover_text │ │ │ │ 61 │ │ │ ) │ │ 62 │ │ │ │ 63 │ │ dict_ = self.speech_service._wrap_generate_from_text(text, **kwargs) │ │ ❱ 64 │ │ tracker = VoiceoverTracker(self, dict_, self.speech_service.cache_dir) │ │ 65 │ │ self.add_sound(str(Path(self.speech_service.cache_dir) / dict_["final_audio"])) │ │ 66 │ │ self.current_tracker = tracker │ │ 67 │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │ │ acker.py:58 in __init__ │ │ │ │ 55 │ │ self.end_t = last_t + self.duration │ │ 56 │ │ │ │ 57 │ │ if "word_boundaries" in self.data: │ │ ❱ 58 │ │ │ self._process_bookmarks() │ │ 59 │ │ │ 60 │ def _process_bookmarks(self) -> None: │ │ 61 │ │ self.bookmark_times = {} │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │ │ acker.py:63 in _process_bookmarks │ │ │ │ 60 │ def _process_bookmarks(self) -> None: │ │ 61 │ │ self.bookmark_times = {} │ │ 62 │ │ self.bookmark_distances = {} │ │ ❱ 63 │ │ self.time_interpolator = TimeInterpolator(self.data["word_boundaries"]) │ │ 64 │ │ net_text_len = len(remove_bookmarks(self.data["input_text"])) │ │ 65 │ │ if "transcribed_text" in self.data: │ │ 66 │ │ │ transcribed_text_len = len(self.data["transcribed_text"].strip()) │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │ │ acker.py:24 in __init__ │ │ │ │ 21 │ │ │ self.x.append(wb["text_offset"]) │ │ 22 │ │ │ self.y.append(wb["audio_offset"] / AUDIO_OFFSET_RESOLUTION) │ │ 23 │ │ │ │ ❱ 24 │ │ self.f = interp1d(self.x, self.y) │ │ 25 │ │ │ 26 │ def interpolate(self, distance: int) -> np.ndarray: │ │ 27 │ │ try: │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/scipy/interpolate/ │ │ _interpolate.py:484 in __init__ │ │ │ │ 481 │ │ │ │ 482 │ │ # Interpolation goes internally along the first axis │ │ 483 │ │ self.y = y │ │ ❱ 484 │ │ self._y = self._reshape_yi(self.y) │ │ 485 │ │ self.x = x │ │ 486 │ │ del y, x # clean up namespace to prevent misuse; use attributes │ │ 487 │ │ self._kind = kind │ │ │ │ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/scipy/interpolate/ │ │ _polyint.py:110 in _reshape_yi │ │ │ │ 107 │ │ │ ok_shape = "%r + (N,) + %r" % (self._y_extra_shape[-self._y_axis:], │ │ 108 │ │ │ │ │ │ │ │ │ │ self._y_extra_shape[:-self._y_axis]) │ │ 109 │ │ │ raise ValueError("Data must be of shape %s" % ok_shape) │ │ ❱ 110 │ │ return yi.reshape((yi.shape[0], -1)) │ │ 111 │ │ │ 112 │ def _set_yi(self, yi, xi=None, axis=None): │ │ 113 │ │ if axis is None: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: cannot reshape array of size 0 into shape (0,newaxis) ```

System specifications

System Details - OS (with version, e.g., Windows 10 v2004 or macOS 10.15 (Catalina)): Linux Ubuntu 22.04.1 - RAM: 16GB - Python version (`python/py/python3 --version`): 3.10.6 - Installed modules (provide output from `pip list`): ``` Package Version ------------------------ ----------- certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 click-default-group 1.2.2 cloup 0.13.1 colour 0.1.5 commonmark 0.9.1 decorator 5.1.1 evdev 1.6.0 ffmpeg-python 0.2.0 filelock 3.9.0 future 0.18.2 glcontext 2.3.7 huggingface-hub 0.11.1 humanhash3 0.0.6 idna 3.4 isosurfaces 0.1.0 manim 0.17.2 manim-voiceover 0.2.1.post1 ManimPango 0.4.3 mapbox-earcut 1.0.1 moderngl 5.7.4 moderngl-window 2.4.2 more-itertools 9.0.0 multipledispatch 0.6.0 mutagen 1.46.0 networkx 2.8.8 numpy 1.24.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 packaging 22.0 Pillow 9.3.0 pip 22.3.1 playsound 1.3.0 PyAudio 0.2.13 pycairo 1.23.0 pydub 0.25.1 pyglet 2.0.2.1 Pygments 2.13.0 PyGObject 3.42.2 pynput 1.7.6 pyrr 0.10.3 python-dotenv 0.21.0 python-xlib 0.33 PyYAML 6.0 regex 2022.10.31 requests 2.28.1 rich 12.6.0 scipy 1.9.3 screeninfo 0.8.1 setuptools 60.2.0 six 1.16.0 skia-pathops 0.7.4 sox 1.4.1 srt 3.5.2 stable-ts 1.0.1 svgelements 1.9.0 tokenizers 0.13.2 torch 1.13.1 tqdm 4.64.1 transformers 4.25.1 typing_extensions 4.4.0 urllib3 1.26.13 watchdog 2.2.0 wheel 0.37.1 whisper 1.0 ```
LaTeX details + LaTeX distribution (e.g. TeX Live 2020): + Installed LaTeX packages:
FFMPEG Output of `ffmpeg -version`: ``` ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 ```

Additional comments

osolmaz commented 1 year ago

You recorded something using the recording CLI, right?

Can you also post the file media/voiceovers/cache.json under your project directory?

psmlbhor commented 1 year ago

@osolmaz I had recorded it once or twice, don't remember exactly. I pressed the key to listen to my recording and started getting the above error. After retrying again, it stopped asking me for recording and started throwing the errors immediately. Here is the output from cache.json:

  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "service": "gtts"
    },
    "original_audio": "carolina-blue-pasta-bravo.mp3",
    "final_audio": "carolina-blue-pasta-bravo.mp3"
  },
  {
    "input_text": "Let's shift it to the left 2 units.",
    "input_data": {
      "input_text": "Let's shift it to the left 2 units.",
      "service": "gtts"
    },
    "original_audio": "sweet-november-mike-delta.mp3",
    "final_audio": "sweet-november-mike-delta.mp3"
  },
  {
    "input_text": "Now, let's transform it into a square.",
    "input_data": {
      "input_text": "Now, let's transform it into a square.",
      "service": "gtts"
    },
    "original_audio": "hawaii-gee-potato-july.mp3",
    "final_audio": "hawaii-gee-potato-july.mp3"
  },
  {
    "input_text": "Thank you for watching.",
    "input_data": {
      "input_text": "Thank you for watching.",
      "service": "gtts"
    },
    "original_audio": "twelve-moon-ceiling-nitrogen.mp3",
    "final_audio": "twelve-moon-ceiling-nitrogen.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  },
  {
    "input_text": "This circle is drawn as I speak.",
    "input_data": {
      "input_text": "This circle is drawn as I speak.",
      "config": {
        "format": 8,
        "channels": 1,
        "rate": 44100,
        "chunk": 512
      },
      "service": "recorder"
    },
    "original_audio": "stairway-ink-quebec-cold.mp3",
    "word_boundaries": [],
    "transcribed_text": "",
    "final_audio": "stairway-ink-quebec-cold.mp3"
  }
]
osolmaz commented 1 year ago

@psmlbhor I am yet to implement sanity checks, but you can always remove any problematic recordings from cache.json manually.

Did you listen to the mp3 files? Maybe there is an issue with encoding

psmlbhor commented 1 year ago

@osolmaz I removed the entries in cache.json. I also deleted the previous audio files manually and retried. This time it was able to record successfully again, but still gave the initial error:

$ manim -pql voice_over.py --disable_caching
Manim Community v0.17.2

ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
-------------------------device list-------------------------
Input Device id  0  -  HDA Intel PCH: ALC294 Analog (hw:0,0)
Input Device id  12  -  sysdefault
Input Device id  18  -  samplerate
Input Device id  19  -  speexrate
Input Device id  20  -  pulse
Input Device id  21  -  upmix
Input Device id  22  -  vdownmix
Input Device id  24  -  default
-------------------------------------------------------------
Please select an input device id to record from:
0
Selected device: HDA Intel PCH: ALC294 Analog (hw:0,0)
╔══════════════════════════════════╗
║ Voiceover:                       ║
║                                  ║
║ This circle is drawn as I speak. ║
╚══════════════════════════════════╝
Press and hold the 'r' key to begin recording
Wait for 1 second, then start speaking.
Wait for at least 1 second after you finish speaking.
This is to eliminate any sounds that may come from your keyboard.
The silence at the beginning and end will be trimmed automatically.
You can adjust this setting using the `trim_silence_threshold` argument.
These instructions are only shown once.
Release the 'r' key to end recording
rStream active: True
start Stream
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrFinished recording, saving to media/voiceovers/stairway-ink-quebec-cold.mp3
[12/30/22 16:12:32] INFO     Saved media/voiceovers/stairway-ink-quebec-cold.mp3                                                                                              helper.py:36
Press...
 l to [l]isten to the recording
 r to [r]e-record
 a to [a]ccept the recording

l
Press...
 l to [l]isten to the recording
 r to [r]e-record
 a to [a]ccept the recording

a
/home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/stable_whisper/whisper_word_level.py:169: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detected language: hindi
/home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/stable_whisper/stabilization.py:371: UserWarning: No Segments Found
  warnings.warn('No Segments Found')
/home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/stable_whisper/whisper_word_level.py:466: UserWarning: No segments found, whole-word timestamps cannot be added.
  add_whole_word_ts(tokenizer, all_segments,
[12/30/22 16:13:25] INFO     Transcription:                                                                                                                                     base.py:86
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim/cli/render/c │
│ ommands.py:115 in render                                                                         │
│                                                                                                  │
│   112 │   │   │   try:                                                                           │
│   113 │   │   │   │   with tempconfig({}):                                                       │
│   114 │   │   │   │   │   scene = SceneClass()                                                   │
│ ❱ 115 │   │   │   │   │   scene.render()                                                         │
│   116 │   │   │   except Exception:                                                              │
│   117 │   │   │   │   error_console.print_exception()                                            │
│   118 │   │   │   │   sys.exit(1)                                                                │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim/scene/scene. │
│ py:223 in render                                                                                 │
│                                                                                                  │
│    220 │   │   """                                                                               │
│    221 │   │   self.setup()                                                                      │
│    222 │   │   try:                                                                              │
│ ❱  223 │   │   │   self.construct()                                                              │
│    224 │   │   except EndSceneEarlyException:                                                    │
│    225 │   │   │   pass                                                                          │
│    226 │   │   except RerunSceneException as e:                                                  │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/circle_test/voice_over.py:17 in construct            │
│                                                                                                  │
│   14 │   │   circle = Circle()                                                                   │
│   15 │   │                                                                                       │
│   16 │   │   # Surround animation sections with with-statements:                                 │
│ ❱ 17 │   │   with self.voiceover(text="This circle is drawn as I speak.") as tracker:            │
│   18 │   │   │   self.play(Create(circle), run_time=tracker.duration)                            │
│   19 │   │   │   # The duration of the animation is received from the audio file                 │
│   20 │   │   │   # and passed to the tracker automatically.                                      │
│                                                                                                  │
│ /usr/lib/python3.10/contextlib.py:135 in __enter__                                               │
│                                                                                                  │
│   132 │   │   # they are only needed for recreation, which is not possible anymore               │
│   133 │   │   del self.args, self.kwds, self.func                                                │
│   134 │   │   try:                                                                               │
│ ❱ 135 │   │   │   return next(self.gen)                                                          │
│   136 │   │   except StopIteration:                                                              │
│   137 │   │   │   raise RuntimeError("generator didn't yield") from None                         │
│   138                                                                                            │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/vo │
│ iceover_scene.py:180 in voiceover                                                                │
│                                                                                                  │
│   177 │   │                                                                                      │
│   178 │   │   try:                                                                               │
│   179 │   │   │   if text is not None:                                                           │
│ ❱ 180 │   │   │   │   yield self.add_voiceover_text(text, **kwargs)                              │
│   181 │   │   │   elif ssml is not None:                                                         │
│   182 │   │   │   │   yield self.add_voiceover_ssml(ssml, **kwargs)                              │
│   183 │   │   finally:                                                                           │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/vo │
│ iceover_scene.py:64 in add_voiceover_text                                                        │
│                                                                                                  │
│    61 │   │   │   )                                                                              │
│    62 │   │                                                                                      │
│    63 │   │   dict_ = self.speech_service._wrap_generate_from_text(text, **kwargs)               │
│ ❱  64 │   │   tracker = VoiceoverTracker(self, dict_, self.speech_service.cache_dir)             │
│    65 │   │   self.add_sound(str(Path(self.speech_service.cache_dir) / dict_["final_audio"]))    │
│    66 │   │   self.current_tracker = tracker                                                     │
│    67                                                                                            │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │
│ acker.py:58 in __init__                                                                          │
│                                                                                                  │
│    55 │   │   self.end_t = last_t + self.duration                                                │
│    56 │   │                                                                                      │
│    57 │   │   if "word_boundaries" in self.data:                                                 │
│ ❱  58 │   │   │   self._process_bookmarks()                                                      │
│    59 │                                                                                          │
│    60 │   def _process_bookmarks(self) -> None:                                                  │
│    61 │   │   self.bookmark_times = {}                                                           │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │
│ acker.py:63 in _process_bookmarks                                                                │
│                                                                                                  │
│    60 │   def _process_bookmarks(self) -> None:                                                  │
│    61 │   │   self.bookmark_times = {}                                                           │
│    62 │   │   self.bookmark_distances = {}                                                       │
│ ❱  63 │   │   self.time_interpolator = TimeInterpolator(self.data["word_boundaries"])            │
│    64 │   │   net_text_len = len(remove_bookmarks(self.data["input_text"]))                      │
│    65 │   │   if "transcribed_text" in self.data:                                                │
│    66 │   │   │   transcribed_text_len = len(self.data["transcribed_text"].strip())              │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/manim_voiceover/tr │
│ acker.py:24 in __init__                                                                          │
│                                                                                                  │
│    21 │   │   │   self.x.append(wb["text_offset"])                                               │
│    22 │   │   │   self.y.append(wb["audio_offset"] / AUDIO_OFFSET_RESOLUTION)                    │
│    23 │   │                                                                                      │
│ ❱  24 │   │   self.f = interp1d(self.x, self.y)                                                  │
│    25 │                                                                                          │
│    26 │   def interpolate(self, distance: int) -> np.ndarray:                                    │
│    27 │   │   try:                                                                               │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/scipy/interpolate/ │
│ _interpolate.py:484 in __init__                                                                  │
│                                                                                                  │
│    481 │   │                                                                                     │
│    482 │   │   # Interpolation goes internally along the first axis                              │
│    483 │   │   self.y = y                                                                        │
│ ❱  484 │   │   self._y = self._reshape_yi(self.y)                                                │
│    485 │   │   self.x = x                                                                        │
│    486 │   │   del y, x  # clean up namespace to prevent misuse; use attributes                  │
│    487 │   │   self._kind = kind                                                                 │
│                                                                                                  │
│ /home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/scipy/interpolate/ │
│ _polyint.py:110 in _reshape_yi                                                                   │
│                                                                                                  │
│   107 │   │   │   ok_shape = "%r + (N,) + %r" % (self._y_extra_shape[-self._y_axis:],            │
│   108 │   │   │   │   │   │   │   │   │   │      self._y_extra_shape[:-self._y_axis])            │
│   109 │   │   │   raise ValueError("Data must be of shape %s" % ok_shape)                        │
│ ❱ 110 │   │   return yi.reshape((yi.shape[0], -1))                                               │
│   111 │                                                                                          │
│   112 │   def _set_yi(self, yi, xi=None, axis=None):                                             │
│   113 │   │   if axis is None:                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: cannot reshape array of size 0 into shape (0,newaxis)

After this, I am not able to rerun it. It directly throws the error without even asking for recording.

psmlbhor commented 1 year ago

I forgot to mention that manim didn't create a video when it was able to record.

osolmaz commented 1 year ago

@psmlbhor The main issue is transcription fails, you can see in that word_boundaries is empty.

There are multiple things going slightly wrong here. The example you're trying to render doesn't have bookmarks, so in principle it should never invoke anything that requires word boundary calculations. I'll try to reproduce this locally and fix it in the next version.

On the other hand, it seems like Whisper failed to install. Can you check whether you can call whisper in the terminal? You could try to record an mp3 and transcribe it in the terminal like

whisper your_file.mp3

See the Whisper repo for more details:

https://github.com/openai/whisper

psmlbhor commented 1 year ago

@osolmaz Sure, do let me know if you are able to repro it or what fix I can do on my side.

I tried your example for whisper and it looks like it works:

$ whisper stairway-ink-quebec-cold.mp3 --language Hindi --model medium
100%|█████████████████████████████████████| 1.42G/1.42G [03:31<00:00, 7.21MiB/s]
/home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
$ ls -lrt
total 740
-rw-rw-r-- 1 pranjal pranjal 748191 Dec 30 16:12 stairway-ink-quebec-cold.mp3
-rw-rw-r-- 1 pranjal pranjal    898 Dec 30 16:49 cache.json
-rw-rw-r-- 1 pranjal pranjal      8 Jan  2 12:37 stairway-ink-quebec-cold.mp3.vtt
-rw-rw-r-- 1 pranjal pranjal      0 Jan  2 12:37 stairway-ink-quebec-cold.mp3.txt
-rw-rw-r-- 1 pranjal pranjal      0 Jan  2 12:37 stairway-ink-quebec-cold.mp3.srt

The generated files, however, don't contain anything.

osolmaz commented 1 year ago

That shouldn't happen, either there is something wrong with the recorded mp3s, or your Whisper installation.

Can you also try out whisper with this sample from LJSpeech? LJ037-0171.wav.zip

psmlbhor commented 1 year ago

@osolmaz

$ whisper LJ037-0171.wav 
/home/pranjal/PycharmProjects/CipherCompute/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:05.760]  The examination and testimony of the experts enabled the commission to conclude that five
[00:05.760 --> 00:29.760]  shots may have been fired.

$ ls -lrt LJ037-0171.wav*
-rw-r--r-- 1 pranjal pranjal 334496 Jan  2 16:12 LJ037-0171.wav
-rw-rw-r-- 1 pranjal pranjal    175 Jan  2 18:49 LJ037-0171.wav.vtt
-rw-rw-r-- 1 pranjal pranjal    117 Jan  2 18:49 LJ037-0171.wav.txt
-rw-rw-r-- 1 pranjal pranjal    183 Jan  2 18:49 LJ037-0171.wav.srt

$ cat LJ037-0171.wav.txt 
The examination and testimony of the experts enabled the commission to conclude that five
shots may have been fired.

$ cat LJ037-0171.wav.srt 
1
00:00:00,000 --> 00:00:05,760
The examination and testimony of the experts enabled the commission to conclude that five

2
00:00:05,760 --> 00:00:29,760
shots may have been fired.
osolmaz commented 1 year ago

@psmlbhor it seems like Whisper is working. You were just rendering the recorder example, right? I can't reproduce this on my side.

Can you edit the corresponding file in your manim_voiceover installation, and add a breakpoint here, or print what is being returned by transcribe()? In VS Code, I can go to any installed package source with Cmd+Click on anything that is imported from the package. You can debug it by making changes to the installed files under your distribution's site-packages folder.

https://github.com/ManimCommunity/manim-voiceover/blob/e2a88262ea5dd5b8433180f23e7bac8e604b0407/manim_voiceover/services/base.py#L90-L92

I've also made a minor update to manim voiceover in the meanwhile, can you update to 0.2.2?