[Question] What is range times?

maepopi commented 3 months ago

Hey there!

I'm starting my investigation of your tool as promised in our discussion on the Metavoice repo, and I have a few questions for you! First of all, I was wondering what the "range-times" were referring to? I thought it was the approximate duration of each segmented clip but it doesn't look like it. Also, I have passed a 2:50 audio through the tool, and it came out not segmented.

Maybe I'm doing something wrong?

Thanks :)

davidmartinrius commented 3 months ago

Hi @maepopi please, read this section before: https://github.com/davidmartinrius/speech-dataset-generator?tab=readme-ov-file#the-audio-is-not-always-100-splitted-into-sub-files

maepopi commented 3 months ago

Hey!

Oups sorry I had indeed read this section but I had forgotten it. So I have tried with one, then two enhancers (deepfilter and resembleai), and the audio was still indicated as discarded.

When I tried with mayavoz in addition to the two others, I had a first error saying that the model was gated. I had seen in your readme that you needed to go to the model page on HF to accept the conditions, and that's what I did. I also exported the HF_TOKEN as an environment variable.

In spite of this, I still have an error, here it is in full:

Could not download 'pyannote/segmentation' model.
It might be because the model is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

   >>> Model.from_pretrained('pyannote/segmentation',
   ...                       use_auth_token=YOUR_AUTH_TOKEN)

If this still does not work, it might be because the model is gated:
visit https://hf.co/pyannote/segmentation to accept the user conditions.
Traceback (most recent call last):
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/speech_dataset_generator/main.py", line 59, in <module>
    process_audio_files(audio_files, output_directory, start, end, enhancers, datasets)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/speech_dataset_generator/audio_processor/audio_processor.py", line 144, in process_audio_files
    dataset_generator.process(audio_file, output_directory, start, end, enhancers, collection, datasets)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/speech_dataset_generator/dataset_generator/dataset_generator.py", line 437, in process
    transcription, language = self.get_transcription(enhanced_audio_file_path)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/speech_dataset_generator/dataset_generator/dataset_generator.py", line 188, in get_transcription
    diarize_model = whisperx.DiarizationPipeline(model_name='pyannote/speaker-diarization@2.1', use_auth_token=HF_TOKEN, device=device)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/venv/lib/python3.10/site-packages/whisperx/diarize.py", line 19, in __init__
    self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/venv/lib/python3.10/site-packages/pyannote/audio/core/pipeline.py", line 136, in from_pretrained
    pipeline = Klass(**params)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/venv/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 130, in __init__
    model: Model = get_model(segmentation, use_auth_token=use_auth_token)
  File "/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/venv/lib/python3.10/site-packages/pyannote/audio/pipelines/utils/getter.py", line 89, in get_model
    model.eval()
AttributeError: 'NoneType' object has no attribute 'eval'

I tried changing the HF_TOKEN and reloading the code, and I went to see in the scripts if I had to define the HF_TOKEN myself but it's already set to a variable. I am pretty sure my HF_TOKEN variable is well set, because it appears so hen I do echo $HF_TOKEN.

That said, I am more used to using conda, and much less using venv so maybe I'm doing something wrong here.

Thanks!

davidmartinrius commented 3 months ago

Please, read this section before https://github.com/davidmartinrius/speech-dataset-generator?tab=readme-ov-file#needed-agreement-to-run-the-code

maepopi commented 3 months ago

Yep, I read that and followed the instructions...I "logged in" both models : embeddin and speaker diarization, and on both pages it says "you've been granted access to this model". And yet the code still crashes :/

davidmartinrius commented 3 months ago

Try hardcoding your hf token here https://github.com/davidmartinrius/speech-dataset-generator/blob/113b16774c9a6f45c771fc75e283a98a692e602e/speech_dataset_generator/dataset_generator/dataset_generator.py#L40

And tell me if it worked. If it does not you could try to create a new token in hf. Maybe your token is restricted?

maepopi commented 3 months ago

Ok nothing worked :/

Here's what I tried:

Generating new tokens on HF, both in write and read mode
Tried writing the token in a .env file at the root of the project (in desperation, I even tried putting quotes)
Tried hardcoding as you said, using both quotes and without

At this point I think I'll nuke everything and try again from scratch : I'm doing this remotely right now, from my office, and for some reason I cannot see what I'm writing in the console once in the venv, I wonder whether it is due to using Reemo but anyway. Best to try from scratch in the best conditions and see how it goes.

I'll keep you posted when I get home

davidmartinrius commented 3 months ago

You did this for both? Important: Make sure to agree to share your contact information to access the pyannote embedding model. Similarly, access to the pyannote speaker diarization model may require similar agreement.

maepopi commented 3 months ago

Yep, I did it for both...

davidmartinrius commented 3 months ago

Also check your HF local account is the same in the browser. Maybe you have multiple HF accounts. And maybe you are logged in another account in the terminal or you are not already logged

maepopi commented 3 months ago

Is there a way to see in the terminal too see to which HF accounted you're logged in? Although I only ever created one account, of that I'm pretty sure

maepopi commented 3 months ago

OH wait.... I dived more into the instructions of the HF models in question and it seems there are more sub models to agree to. This one for instance https://huggingface.co/pyannote/segmentation-3.0 I still don't have access I'll try agreeing to everything and tell you how it goes

maepopi commented 3 months ago

Arf...No, still nothing. I've noticed that the more you dig into the models you have to agree to, there's a dead link though, supposed to lead you to pyannote/speaker-diarization-3.1 : https://hf.co/pyannote-speaker-diarization-3.1. But I don't think that's linked : the error message says it's the segmentation model it cannot load.

So...I'm out of ideas for now ^^'

EDIT : I think I finally succeeded. There was a detail I had overlooked : buried in the error message was this link https://hf.co/pyannote/segmentation => I think you need to give your credentials to THIS model (god what a maze!).

The computation is running right now, I'll see if it gets through!

maepopi commented 3 months ago

Okay new problem : the computation aborted, and I can't figure out why. I'm putting here the total log since starting the mayavoz enhancing process. Do tell me if you'd rather I create another thread for this

Lightning automatically upgraded your loaded checkpoint from v1.7.7 to v2.2.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../.cache/torch/mayavoz/f41e190ef46b968b37e0f8954f670fd2f101d9e9e36dbfd38167f29c47b2cd90.749856c8692645d6eb2cc81ba2e75a7e06108600cec9898a47223265bf61dba6`
removing silences
2024-04-05 16:52:09.906997: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.908286: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.908430: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.909293: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.909429: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.909551: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.909716: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.909841: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.909968: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-05 16:52:09.910074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8145 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:05:00.0, compute capability: 8.6
1 Physical GPUs, 1 Logical GPUs
Loaded  speechmetrics.absolute.mosnet
Loaded  speechmetrics.absolute.srmr
2024-04-05 16:52:31.767786: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8902
CUDA is available. Using GPU.
No language specified, language will be first be detected for each audio file (increases inference time).
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.1+cu121. Bad things might happen unless you revert torch to 1.x.
Detected language: en (1.00) in first 30s of audio...
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████| 17.7M/17.7M [00:00<00:00, 100MB/s]
config.yaml: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 318/318 [00:00<00:00, 3.02MB/s]
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.1+cu121. Bad things might happen unless you revert torch to 1.x.
hyperparams.yaml: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1.92k/1.92k [00:00<00:00, 14.4MB/s]
embedding_model.ckpt: 100%|████████████████████████████████████████████████████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 106MB/s]
mean_var_norm_emb.ckpt: 100%|█████████████████████████████████████████████████████████████████████████████████| 1.92k/1.92k [00:00<00:00, 18.9MB/s]
classifier.ckpt: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 5.53M/5.53M [00:00<00:00, 100MB/s]
label_encoder.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 129k/129k [00:00<00:00, 787kB/s]
Downloading data from https://github.com/ina-foss/inaSpeechSegmenter/releases/download/models/keras_speech_music_noise_cnn.hdf5
3244808/3244808 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
WARNING:absl:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
Downloading data from https://github.com/ina-foss/inaSpeechSegmenter/releases/download/models/keras_male_female_cnn.hdf5
6040200/6040200 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
WARNING:absl:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1712328777.268597  125725 service.cc:145] XLA service 0x6194921610d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1712328777.268630  125725 service.cc:153]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2024-04-05 16:52:57.275008: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-05 16:52:57.366706: F external/local_xla/xla/service/gpu/triton_autotuner.cc:634] Non-OK-status: has_executable.status() status: INTERNAL: XLA requires ptxas version 11.8 or higherFailure occured when compiling fusion triton_gemm_dot.188 with config '{block_m:32,block_n:32,block_k:32,split_k:8,num_stages:1,num_warps:4}'
Fused HLO computation:
%triton_gemm_dot.188_computation (parameter_0: f32[32,256], parameter_1: f32[256,256]) -> f32[32,256] {
  %parameter_0 = f32[32,256]{1,0} parameter(0)
  %constant.4 = f32[] constant(0), metadata={op_type="Relu" op_name="sequential_3_1/activation_15_1/Relu" source_file="/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/venv/lib/python3.10/site-packages/tensorflow/python/framework/ops.py" source_line=1177}
  %broadcast.38 = f32[32,256]{1,0} broadcast(f32[] %constant.4), dimensions={}, metadata={op_type="Relu" op_name="sequential_3_1/activation_19_1/Relu"}
  %maximum.9 = f32[32,256]{1,0} maximum(f32[32,256]{1,0} %parameter_0, f32[32,256]{1,0} %broadcast.38), metadata={op_type="Relu" op_name="sequential_3_1/activation_19_1/Relu"}
  %parameter_1 = f32[256,256]{1,0} parameter(1)
  ROOT %dot.0 = f32[32,256]{1,0} dot(f32[32,256]{1,0} %maximum.9, f32[256,256]{1,0} %parameter_1), lhs_contracting_dims={1}, rhs_contracting_dims={0}, frontend_attributes={grad_x="false",grad_y="false"}, metadata={op_type="MatMul" op_name="sequential_3_1/dense_7_1/MatMul" source_file="/home/maelys/AI_PROJECTS/SOUND/TOOLS/speech-dataset-generator/venv/lib/python3.10/site-packages/tensorflow/python/framework/ops.py" source_line=1177}
}
Aborted (core dumped)

Might be because my input is 22050 Hz and not 16000 as you write in your readme?

If it's the case, none of the enhance filters are enough to be able to split my input

Thank you for your patience!

davidmartinrius commented 3 months ago

Can you tell me: OS CPU RAM GPU Cuda toolkit version

Also the complete command you are using.

maepopi commented 3 months ago

Yes

OS = Ubuntu 22.04.4 LTS CPU = AMD Ryzen 7 5800x 8-core processor × 16 GPU = RTX 3060 12GB VRAM RAM = 64Gb

When typing nvcc --version in the venv, I have this

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

The command

python speech_dataset_generator/main.py --input_file_path /home/maelys/AI_PROJECTS/SOUND/DATA_CENTER/data_inputs/Mark_Noble/Audiobooks/CENTURION_SERIES/RIPPED/Betrayal/clips/Centurions1_Book1_Chapter0_Preface.wav --output_directory /home/maelys/AI_PROJECTS/SOUND/DATA_CENTER/data_inputs/Mark_Noble/Audiobooks/CENTURION_SERIES/RIPPED/Betrayal/clips/speech_dataset_generator_tests --range_times 6-11 --enhancers deepfilternet resembleai mayavoz

For good measure, I'll nuke everything and try from scratch

davidmartinrius commented 3 months ago

The hardware requirements and the OS are more than fine.

I have tested this project only with CUDA 12.1 . Actually I don't know if the version 11.5 is compatible or not..

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

maepopi commented 3 months ago

Ah! So somehow pip install e . installed the wrong stuff. Let me try the installation again

davidmartinrius commented 3 months ago

Well, it does not depend on setup.py or requirements. The cuda version actually depends on your nvidia cuda toolkit. There is a way that you can have multiple cuda versions in the OS and change the cuda version depending on your needs, if you would like to keep cuda 11.5

sudo update-alternatives --list cuda sudo update-alternatives --config cuda

maepopi commented 3 months ago

Yeah but isn't the nvidia cuda toolkit specific to each virtual environment? Because if I type nvidia-smi out of any virtual environment, I'm on 12.2. Whereas when I call nvcc within the environment, I'm on 11.5, so I suppose it was indeed the venv that installed this?

Actually each conda environment I have for different projects has different cuda toolkits

davidmartinrius commented 3 months ago

Ah you are using conda! Sorry, I said nothing

maepopi commented 3 months ago

You know what, I'm going to try using conda instead of venv, see if I can fare better. I'm much more used to conda. Hang on. We'll get there eventually 😂

maepopi commented 3 months ago

Okay...Well there's just no way I can make this work...I've tried a lot of things during the past hour:

Conda installation is a nightmare. Broken dependencies everywhere.
Back to venv, I've reinstalled everything and tried the same command : I still have the Aborted (core dumped) message. However, I have this in the console when running the different enhancers : 2024-04-05 21:01:39 | INFO | DF | Running on torch 2.1.1+cu121 I'm kind of confused because I interpret cu121 as cuda tool kit 12.1, so I don't get why nvcc --version gives me 11.5.
I tried removing the mayavoz enhancer, then resemble ai, then all enhancers. Same result
I tried on another audio as input. Same result, for all enhancers combination.

I'm going to give up for now. If you have other ideas or roads I haven't taken, I'm all ears!

EDIT : If I might suggest, I would add to the readme that you need to accept conditions for HF segmentation model as well. I can try and make a pull request if you want, that will teach me how to do it

davidmartinrius commented 3 months ago

Well, it does not depend on setup.py or requirements. The cuda version actually depends on your nvidia cuda toolkit. There is a way that you can have multiple cuda versions in the OS and change the cuda version depending on your needs, if you would like to keep cuda 11.5

sudo update-alternatives --list cuda sudo update-alternatives --config cuda

Have you tried this?

The cuda os version needs to match the torch cuda version. If you have 11.5 in your OS but you are using torch 12.1 it won't work. You need to change the cuda version of the OS

maepopi commented 3 months ago

Well I'm a bit on uncharted territory there, but what I know is that my OS Cuda is 12.2 (according to nvidia-smi)

When, inside the venv, I write sudo update-alternatives --list cuda, I have: /usr/local/cuda-12.3

When, inside the venv, I write sudo update-alternatives --config cuda, I have: sr/local/cuda-12.3 Nothing to configure.

....Which again seems strange.

davidmartinrius commented 1 month ago

Closed due to inactivity

davidmartinrius / speech-dataset-generator

[Question] What is range times? #11