Load the model once for repeated inference

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.95k stars 2.49k forks source link

Load the model once for repeated inference #411

Closed LironHarazi closed 4 years ago

LironHarazi commented 4 years ago

Hi,using TTS example for inference (tts_infer.py).According to the code given in the example, in order to make inference,we have to load the checkpoints so that in each new inference we have to reload it. How can I load the model once for repeated inference(each time , other text)?

blisc commented 4 years ago

I'm not sure what you mean. tts_infer.py loads the checkpoints once for tacotron2 and once for waveglow. Can you elaborate on your issue?

LironHarazi commented 4 years ago

Look at that abstract example as Pseudo-Code: 1.loading checkpoints for tacotron2. 2.loading checkpoints for waveglow. 3.input text and infer 4.input text and infer 5.... 6.... Etc.... In tts_infer.py you have to reload the model with check points for each new inference as in the "infer" function: evaluated_tensors = neural_factory.infer( tensors=infer_tensors, checkpoint_dir=args.spec_model_load_dir, cache=use_cache, offload_to_cpu=False, )

Hear you pass the checkpoint_dir for each new inference.

For example in pytorch for loading model checkpoint you write model = torch.load(PATH) And from now on ,you can do inference without reloading the model. Hope I'm clear.

blisc commented 4 years ago

In neural_factory.infer(), checkpoint_dir is an optional parameter. I am convinced that you just have to pass checkpoint_dir the first time you run your model. For example, in the tts_infer script, you can do the following:

evaluated_tensors = neural_factory.infer(
    tensors=infer_tensors, checkpoint_dir=args.spec_model_load_dir, cache=use_cache, offload_to_cpu=False,
)
evaluated_tensors = neural_factory.infer(
    tensors=infer_tensors, cache=use_cache, offload_to_cpu=False,
)

And the second time you call infer, there would be no loading. Unfortunately, we do not support interactive data layers so you would have to create your own. Or you could just run the modules using native pytorch by calling waveglow(*inputs, force_pt=True), etc.

LironHarazi commented 4 years ago

Thanks, that solved it. But, I still have memory leak , I mean that when I reinfer with loaded model , after 8 to 10 times the RAM of the GPU explode and the application terminates (their is no memory free from some parts of the code or it accumulates data). do you know what can I do with it?

blisc commented 4 years ago

Can you try again on the latest master branch? I believe this was an issue in v0.9.0 but should be fixed in master. You can install master by cloning master, cd into NeMo, pip install nemo_toolkit[all], and run the latest tts_infer.py. Note that the configs have slightly changed.

Alternatively, in v0.9.0, can you set cache and use_cache to False when running the infer() calls? Unfortunately, doing this will cause the model to run tacotron 2 in the current setup. You can probably remove the first tacotron2 infer call if you are using waveglow.

okuchaiev commented 4 years ago

closing. @LironHarazi please re-open if you still experience the issue