Pretrain Hubert on english and chinese speech dataset.

shihuai commented 4 months ago

Hi~I'm trying pretrain hubert from scratch on english and chinese speech dataset. During pretrain, the 1st iteration loss dropped from 6.7 to 3.3, the 2nd iteration loss dropped from 11.2 to 4.0. Both of two stage iterations loss are too large, is this a normal phenomenon?

zw76859420 commented 4 months ago

Can you show the config of your training?

shihuai commented 4 months ago

Can you show the config of your training?

I use the hubert_base_librispeech.yaml for pretraining, only change the ddp_backend and max_sample_size.

common:
  fp16: true
  log_format: json
  log_interval: 200
  seed: 1337
  tensorboard_logdir: tblog

checkpoint:
  save_interval_updates: 25000
  keep_interval_updates: 1
  no_epoch_checkpoints: true

distributed_training:
  ddp_backend: c10d
  distributed_backend: 'nccl'
  distributed_world_size: 4
  distributed_port: 29671
  nprocs_per_node: 4
  find_unused_parameters: true

task:
  _name: hubert_pretraining
  data: ${task.data}
  label_dir: ${task.label_dir}
  labels: ${task.labels}
  label_rate: ${model.label_rate}
  sample_rate: 16000
  max_sample_size: 320000 #250000
  min_sample_size: 32000
  pad_audio: false
  random_crop: true
  normalize: false # must be consistent with extractor

dataset:
  num_workers: 6
  max_tokens: 1400000
  skip_invalid_size_inputs_valid_test: true
  validate_interval: 5
  validate_interval_updates: 10000

criterion:
  _name: hubert
  pred_masked_weight: 1.0
  pred_nomask_weight: 0.0
  loss_weights: [10,]

optimization:
  max_update: 400000
  lr: [0.00025]
  clip_norm: 10.0

optimizer:
  _name: adam
  adam_betas: (0.9,0.98)
  adam_eps: 1e-06
  weight_decay: 0.01

lr_scheduler:
  _name: polynomial_decay
  warmup_updates: 32000

model:
  _name: hubert
  label_rate: 100
  skip_masked: false
  skip_nomask: false
  mask_prob: 0.80
  extractor_mode: default
  conv_feature_layers: '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2'
  final_dim: 256
  encoder_layerdrop: 0.05
  dropout_input: 0.1
  dropout_features: 0.1
  dropout: 0.1
  attention_dropout: 0.1
  feature_grad_mult: 0.1
  untie_final_proj: true
  activation_dropout: 0.0

hydra:
  job:
    config:
      override_dirname:
        kv_sep: '-'
        item_sep: '__'
        exclude_keys:
          - run
          - task.data
          - task.label_dir
  run:
    dir: ???
  sweep:
    dir: ???
    subdir: ${hydra.job.config_name}__${hydra.job.override_dirname}

zw76859420 commented 3 months ago

The loss of training hubert on my side can eventually converge to around 2.5, and I used the wenetspeech dataset as the pretrain dataset,which used 10,000 hours of pure Chinese data.

zw76859420 commented 3 months ago

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

shihuai commented 3 months ago

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

OK, Thank you for your reply! We have tried to train the SpeechTokenizer with feature from Hubert, and the reconstructed speech is also good. We will try more experiments on downstream tasks.

dyyoungg commented 2 months ago

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

OK, Thank you for your reply! We have tried to train the SpeechTokenizer with feature from Hubert, and the reconstructed speech is also good. We will try more experiments on downstream tasks.

Will you open source the checkpoint? I think it will be very helpful to the community.

GUOhm230 commented 2 weeks ago

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

OK, Thank you for your reply! We have tried to train the SpeechTokenizer with feature from Hubert, and the reconstructed speech is also good. We will try more experiments on downstream tasks.

I'm doing similar work now. Could you send me your configuration for reference?

shihuai commented 2 days ago

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

OK, Thank you for your reply! We have tried to train the SpeechTokenizer with feature from Hubert, and the reconstructed speech is also good. We will try more experiments on downstream tasks.

Will you open source the checkpoint? I think it will be very helpful to the community.

Yes, we are writing the paper, after we will open source our work.

facebookresearch / fairseq

Pretrain Hubert on english and chinese speech dataset. #5526