Closed jonaskratochvil closed 3 years ago
I have exactly the same issue. Have you managed to resolve it?
What about training loss? Is it decreasing?
Can you please double-check that NeMo restores only encoder and decoder checkpoints and not TRAINER (optimizer's state)? For example, if you are using NeMo v0.10 with quartznet.py script then please make sure that load_dir
doesn't have TRAINER*.pt files.
I've used both v0.10(jasper.py) and v0.11(speech2text.py ) (through latest NGC container) and it behaves the same. In both cases I've started training from latest v2 pre-trained multidataset QuartzNet. Training and validation loss both goes down (lr=1.5e-4) up to a certain point, then validation WER shoots up to 100% but training loss continues to go down normally, predictions of training samples continue to be good also. Using saved checkpoints with speech2text_infer.py returns empty string for every dev_data sample.
Fine-tunning with lower lr (1e-5) seems to fix the problem.
@Jovianan What code you using to re-train (fine-tunning)? I have problem with Hydra. raise ValueError(f'Invalid Datatype for loaders: {type(self.loaders).name}') ValueError: Invalid Datatype for loaders: NoneType
Hello,
I have used the QuartzNet pretrained checkpoint to fine-tune the ASR model on my custom data. This fine-tuning works fine but when I use the newly obtained checkpoint to fine-tune the mode on yet another dataset I get 100% WER and validation loss equal to nan from the first evaluation onward throughout the whole training. Is there any specific reason why this should be the case? I have left the training script untouched between the two fine-tuning runs. When I use the original QuartzNet checkpoint and fine-tune directly on my second dataset both the WER and loss are decreasing as expected. Any help would be appreciated.
Jonas