NavodPeiris / speechlib

speechlib is a library that can do speaker diarization, transcription and speaker recognition on an audio file to create transcripts with actual speaker names
MIT License
138 stars 12 forks source link

transcription in logs file is empty #18

Closed PiotrEsse closed 1 month ago

PiotrEsse commented 8 months ago

Hi, thank You for Your work but I am having issues. Theres no error but after run your example I am getting an almost empty file in logs: In the file theres only following string:
zach (206.8 : 206.8) :

In terminal theres no errors>

(speechlib39) piotr@Legion7:~/speechlib/examples$ python3 transcribe.py
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
obama_zach.wav is already in WAV format.
obama_zach.wav is already a mono audio file.
The file already has 16-bit samples.
config.yaml: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 292kB/s]pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.7M/17.7M [00:00<00:00, 19.4MB/s]config.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 318/318 [00:00<00:00, 36.2kB/s]Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.0+cu121. Bad things might happen unless you revert torch to 1.x.
running diarization...
diarization done. Time taken: 17 seconds.
running speaker recognition...
speaker recognition done. Time taken: 4 seconds.
running transcription...
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.26k/2.26k [00:00<00:00, 660kB/s]vocabulary.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 460k/460k [00:00<00:00, 1.02MB/s]tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20M/2.20M [00:00<00:00, 3.03MB/s]model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.53G/1.53G [00:58<00:00, 26.0MB/s]Cannot check for SPDIF
transcription done. Time taken: 140 seconds.
(speechlib39) piotr@Legion7:~/speechlib/examples$ ls
README.md  audio_cache  logs  obama1.mp3  obama1.wav  obama_zach.wav  preprocess.py  pretrained_models  segments  temp  transcribe.py  voices
(speechlib39) piotr@Legion7:~/speechlib/examples$ python3 transcribe.py
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
obama_zach.wav is already in WAV format.
obama_zach.wav is already a mono audio file.
The file already has 16-bit samples.
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.0+cu121. Bad things might happen unless you revert torch to 1.x.
running diarization...
diarization done. Time taken: 14 seconds.
running speaker recognition...
speaker recognition done. Time taken: 4 seconds.
running transcription...
Cannot check for SPDIF
transcription done. Time taken: 82 seconds.

Content of the file:

image

I have python 3.9, clean conda env. Whisper works flawleslly

NavodPeiris commented 8 months ago
  1. did u ran the same example in this repo? if not then post the code.
  2. what is the model size you used?
  3. did you input paths to obama_zach file correctly?
  4. can you run this in normal python environment instead of conda and tell me if error persists
PiotrEsse commented 8 months ago

Ad 1. Yes, Ive run same example, whithout any changes. I use diarize.py ~/speechlib/examples$ python3 transcribe.py

obama_zach_143156_en.txt Ad 2. I use medium Ad 3. Yes - it process the file. It takes time - 79sec to be precisely Ad 4. Sure, Ill have to prepare clean WSL VM.

elia-morrison commented 5 months ago

This can happen due to a number of reasons because of an insane try/except block in this function.

It literally says:

try:
    trans = transcribe(file, language, modelSize, quantization)  

    # return -> [[start time, end time, transcript], [start time, end time, transcript], ..]
    texts.append([segment[0], segment[1], trans])
except:
    pass

I removed this via a monkeypatch and it revealed the actual issue:

ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

This is a common issue for faster-whisper and is discussed here: https://github.com/SYSTRAN/faster-whisper/issues/42 There may be a different error in your case.

tomich commented 4 months ago

Im having the same problem and it could be solved partially with

https://github.com/NavodPeiris/speechlib/issues/37

In the meantime, i'll try to create a branch in my fork that doesn't use faster-whisper.

Abhishek-cmd13 commented 2 months ago

I am having an empty file at then end when I use sinhala language , I know in the codebase we are providing a different model for sinhala than normal whisper , Can you please help me with this