NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.83k stars 2.46k forks source link

Model Training #5179

Closed Project-fil closed 1 year ago

Project-fil commented 2 years ago

I am training an ru model and I don't know if this process is going well

My code:

import nemo import nemo.collections.asr as nemo_asr import pytorch_lightning as pl from nemo.utils.exp_manager import exp_manager from omegaconf import DictConfig, OmegaConf params = OmegaConf.load(model_conf_path) min_duration = 0.5 epochs = 150 gpus = 1 params.trainer.devices = gpus params.trainer.max_epochs = epochs params.model.train_ds.manifest_filepath = train_manifest_path params.model.train_ds.min_duration = min_duration params.model.validation_ds.manifest_filepath = validation_manifest_path params.model.test_ds.manifest_filepath = test_manifest_path params.exp_manager.exp_dir = "/path/to/save_tr_nemo/" trainer = pl.Trainer(**params.trainer) model = nemo_asr.models.EncDecCTCModel(cfg=params.model, trainer=trainer) exp_manager(trainer=trainer, cfg=params.exp_manager) trainer.fit(model) trainer.test(model)

There is yaml config: https://github.com/Project-fil/Project-fil/blob/ee79825609cca9e8b01a9c4eb43b564bb5c2b35d/quartznet_ru.yaml

Train manifest file looks like this: {"audio_filepath": ""/path/to/wav", "duration": 4.8, "text": "у вас просроченных дней две тысячи сто шестьдесят один дней"} {"audio_filepath": "/path/to/wav", "duration": 9.1, "text": "выключай выключай давайте следующую свечу других нет сейчас расскажу один анекдот когда"} Total files: 253 390 Duration - 5050.82 hours.

I get these files in the "checkpoints" directory and after 10 epochs "wer" never went below 0.9.

'QuartzNet15x5_ru--val_wer=0.9031-epoch=6.ckpt' 'QuartzNet15x5_ru--val_wer=0.9067-epoch=8.ckpt' 'QuartzNet15x5_ru--val_wer=0.9094-epoch=1.ckpt' 'QuartzNet15x5_ru--val_wer=0.9798-epoch=9-last.ckpt'

and now

'QuartzNet15x5_ru--val_wer=0.9031-epoch=6.ckpt' 'QuartzNet15x5_ru--val_wer=0.9067-epoch=8.ckpt' 'QuartzNet15x5_ru--val_wer=0.9078-epoch=11.ckpt' 'QuartzNet15x5_ru--val_wer=0.9798-epoch=19-last.ckpt'

Please tell me this is a normal learning process or am I doing something wrong.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.