Open nonmetal opened 2 years ago
@nonmetal
Hello, have you solved this problem? The keams model I trained for 1000 hours is not working either. The labels from the data are basically the same.
Is it related to the default value of training? In addition to the following parameters
When I execute the following
TYPE='hubert'
ACOUSTIC_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/checkpoints/hubert_base_ls960.pt \
LAYER=6 \
MANIFEST=./examples/wav2vec/manifest/libri100/train.tsv \
KM_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/kmeans_saved/hubert50_new.bin
PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
--num_clusters $N_CLUSTERS \
--feature_type $TYPE \
--checkpoint_path $CKPT_PATH \
--layer $LAYER \
--manifest_path $MANIFEST \
--out_kmeans_model_path $KM_MODEL_PATH
I get the following error about a missing argument. Looking into the code for ([https://github.com/facebookresearch/fairseq/blob/main/examples/textless_nlp/gslm/speech2unit/pretrained/utils.py]), the error is legit
Error:
2023-06-08 18:21:04 | INFO | main | Extracting hubert acoustic features...
Traceback (most recent call last):
File "examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py", line 212, in
What is the channel_id? Any reason why it is not included in the arguments when running the cluster_kmeans.py: 'examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py' ?
Thanks!
@nonmetal @lzl1456 Hi, have you solved this problem? I met the same problem.
@PrabhjotKaurGosal I met the same problem. It is because "cluster_kmeans. py" called the "get_features()" function in "speech2unit/pretrained/utils.py", which requires passing the positional parameter "channel_id". Therefore, I gave the default value of None, which is to change line https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/examples/textless_nlp/gslm/speech2unit/pretrained/utils.py#L71 to
feature_type, checkpoint_path, layer, manifest_path, sample_pct, flatten, channel_id=None
to train the model. But the model I got did not work well.
Do you know if there are any solutions up to date?
I also faced this problem. Just pass channel_id=None
is fine if you use mono or stereo audios.
https://github.com/facebookresearch/fairseq/blob/34973a94d09ecc12092a5ecc8afece5e536b7692/examples/textless_nlp/gslm/speech2unit/pretrained/hubert_feature_reader.py#L34C46-L34C56
Did anyone succeed in making the K-means clustering model? I am facing the same issue in building K-means model to train as it is not giving good result.
Same issue : TypeError: get_features() missing 1 required positional argument: 'channel_id'. Using the latest fairseq codebase.Unable to proceed.
❓ Questions and Help
What is your question?
Hello, I'm having a problem to make a well-converged K-means clustering model for S2U. I am trying to train the K-means clustering model with various corpus types. As my previous trials to train the model was failed, I decided to go back to the origin and tried training as stated in the paper.
In GSLM paper published, they mentioned that they were able to get quantized unit by training on LibriSpeech clean-100h. I downloaded the 100h LibriSpeech-clean corpus, trained K-means clustering model, and did a re-synthesis.
However, the result was very different; the model I trained only produced babbling that it was not able to re-synthesize the input speech any similar. During training, I was able to see that as Minibatch proceeds, the ewa inertia had decreased very tiny bit. It eventually says that the model is converged due to the lack of improvement in inertia.
The input sample, corrupted output sample (trained by the method above), and clean output sample (using pre-trained model) is uploaded below.
Input Sample / Output Sample (clean, pre-trained) / Output Sample (corrupted, mine)
Do you have any assumption why the problem is happening? If there is a GSLM developer who could answer to my question, it would be extremely helpful. Thanks a lot!
Code
First what I did was to make a manifast file of speech corpus through
wav2vec_manifest.py
file.python examples/wav2vec/wav2vec_manifest.py ./examples/wav2vec/LibriSpeech/train-clean-100 --dest ./fairseq/examples/wav2vec/manifest/libri100 --ext flac --valid-percent 0.01
And finally tried to train for K-means clustering model. The code below is directly from README file in the glsm/speech2unit repo.
And did re-synthesis using
resynthesize_speech.py
. There were only difference of$KM_MODEL_PATH
between clean and corrupted output.What have you tried?
I first thought that the amount of corpus was not enough, so I downloaded Librispeech 500h in order to enlarge the amount of dataset. However, the resultant also produced babbling sound.
I also tried to change the KM clustering number (codebook length) from 50 to 200, but the result was same.
What's your environment?
(main environment)
git clone https://github.com/pytorch/fairseq
pip install --editable ./