NLP / GSLM: Did anyone succeed in making K-means clustering model? (S2U)

nonmetal commented 1 year ago

❓ Questions and Help

What is your question?

Hello, I'm having a problem to make a well-converged K-means clustering model for S2U. I am trying to train the K-means clustering model with various corpus types. As my previous trials to train the model was failed, I decided to go back to the origin and tried training as stated in the paper.

Quantization. We use k-means to convert continuous frame representations into discrete representation by training on LibriSpeech clean-100h (Panayotov et al., 2015). We experiment with codebooks that have 50, 100, and 200 units.

In GSLM paper published, they mentioned that they were able to get quantized unit by training on LibriSpeech clean-100h. I downloaded the 100h LibriSpeech-clean corpus, trained K-means clustering model, and did a re-synthesis.

However, the result was very different; the model I trained only produced babbling that it was not able to re-synthesize the input speech any similar. During training, I was able to see that as Minibatch proceeds, the ewa inertia had decreased very tiny bit. It eventually says that the model is converged due to the lack of improvement in inertia.

The input sample, corrupted output sample (trained by the method above), and clean output sample (using pre-trained model) is uploaded below.

Input Sample / Output Sample (clean, pre-trained) / Output Sample (corrupted, mine)

Do you have any assumption why the problem is happening? If there is a GSLM developer who could answer to my question, it would be extremely helpful. Thanks a lot!

Code

First what I did was to make a manifast file of speech corpus through wav2vec_manifest.py file.

python examples/wav2vec/wav2vec_manifest.py ./examples/wav2vec/LibriSpeech/train-clean-100 --dest ./fairseq/examples/wav2vec/manifest/libri100 --ext flac --valid-percent 0.01

And finally tried to train for K-means clustering model. The code below is directly from README file in the glsm/speech2unit repo.

TYPE='hubert'
ACOUSTIC_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/checkpoints/hubert_base_ls960.pt \
LAYER=6 \
MANIFEST=./examples/wav2vec/manifest/libri100/train.tsv \
KM_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/kmeans_saved/hubert50_new.bin

PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
    --num_clusters $N_CLUSTERS \
    --feature_type $TYPE \
    --checkpoint_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_kmeans_model_path $KM_MODEL_PATH

And did re-synthesis using resynthesize_speech.py. There were only difference of $KM_MODEL_PATH between clean and corrupted output.

What have you tried?

I first thought that the amount of corpus was not enough, so I downloaded Librispeech 500h in order to enlarge the amount of dataset. However, the resultant also produced babbling sound.

I also tried to change the KM clustering number (codebook length) from 50 to 200, but the result was same.

What's your environment?

(main environment)

fairseq Version (e.g., 1.0 or main): main(0.12.2)
PyTorch Version (e.g., 1.0) 1.12.1
OS (e.g., Linux): Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-125-generic x86_64)
How you installed fairseq (pip, source): git clone https://github.com/pytorch/fairseq
Build command you used (if compiling from source): pip install --editable ./
Python version: 3.9.13
CUDA/cuDNN version: Build cuda_11.3.r11.3/compiler.29745058_0
GPU models and configuration: Geforce GTX 3090 (NVIDIA Corporation Device 2204 (rev a1))
Any other relevant information: -

lzl1456 commented 1 year ago

@nonmetal
Hello, have you solved this problem? The keams model I trained for 1000 hours is not working either. The labels from the data are basically the same. Is it related to the default value of training? In addition to the following parameters

PrabhjotKaurGosal commented 1 year ago

When I execute the following

TYPE='hubert'
ACOUSTIC_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/checkpoints/hubert_base_ls960.pt \
LAYER=6 \
MANIFEST=./examples/wav2vec/manifest/libri100/train.tsv \
KM_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/kmeans_saved/hubert50_new.bin

PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
    --num_clusters $N_CLUSTERS \
    --feature_type $TYPE \
    --checkpoint_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_kmeans_model_path $KM_MODEL_PATH

I get the following error about a missing argument. Looking into the code for ([https://github.com/facebookresearch/fairseq/blob/main/examples/textless_nlp/gslm/speech2unit/pretrained/utils.py]), the error is legit

Error: 2023-06-08 18:21:04 | INFO | main | Extracting hubert acoustic features... Traceback (most recent call last): File "examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py", line 212, in main(args, logger) File "examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py", line 157, in main get_features( TypeError: get_features() missing 1 required positional argument: 'channel_id'

What is the channel_id? Any reason why it is not included in the arguments when running the cluster_kmeans.py: 'examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py' ?

Thanks!

nangying1112 commented 10 months ago

@nonmetal @lzl1456 Hi, have you solved this problem? I met the same problem.

Haoheya commented 9 months ago

@PrabhjotKaurGosal I met the same problem. It is because "cluster_kmeans. py" called the "get_features()" function in "speech2unit/pretrained/utils.py", which requires passing the positional parameter "channel_id". Therefore, I gave the default value of None, which is to change line https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/examples/textless_nlp/gslm/speech2unit/pretrained/utils.py#L71 to

    feature_type, checkpoint_path, layer, manifest_path, sample_pct, flatten, channel_id=None

to train the model. But the model I got did not work well.

tarudesu commented 4 months ago

Do you know if there are any solutions up to date?

jojonki commented 4 months ago

I also faced this problem. Just pass channel_id=None is fine if you use mono or stereo audios. https://github.com/facebookresearch/fairseq/blob/34973a94d09ecc12092a5ecc8afece5e536b7692/examples/textless_nlp/gslm/speech2unit/pretrained/hubert_feature_reader.py#L34C46-L34C56

mahendraphd commented 1 week ago

Did anyone succeed in making the K-means clustering model? I am facing the same issue in building K-means model to train as it is not giving good result.

facebookresearch / fairseq