NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.79k stars 2.45k forks source link

creating a NeMo model #8601

Closed ShabnamRA closed 7 months ago

ShabnamRA commented 7 months ago

I am trying to learn NeMo from "tutorials/01_NeMo_Models.ipynb"

at the end of the page after crating NeMoGPTv2 class try to create a model : model = NeMoGPTv2(cfg=cfg.model)

facing the following error :

   File "/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-67-1b7caab869c2>", line 1, in <module>
    model = NeMoGPTv2(cfg=cfg.model)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "<ipython-input-31-f04b7157a9ba>", line 3, in __init__
    super().__init__(cfg=cfg, trainer=trainer)
  File "/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/nemo/core/classes/modelPT.py", line 154, in __init__
    self.setup_multiple_validation_data(val_data_config=cfg.validation_ds)
  File "/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/nemo/core/classes/modelPT.py", line 539, in setup_multiple_validation_data
    model_utils.resolve_validation_dataloaders(model=self)
  File "/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/nemo/utils/model_utils.py", line 293, in resolve_validation_dataloaders
    model.setup_validation_data(cfg.validation_ds)
  File "<ipython-input-66-0c8f18429ac6>", line 23, in setup_validation_data
    vocab = f.read().split('')[:-1]  # the -1 here is for the dangling  token in the file
            ^^^^^^^^^^^^^^^^^^
ValueError: empty separator
ShabnamRA commented 7 months ago

In this modified version provided here, split() is called without specifying any separator, which defaults to splitting based on whitespace characters such as space, tab, or newline. This resolved the ValueError caused by the empty separator.You need to modify this tutorial as follows :

class NeMoGPTv2(NeMoGPT):

    def setup_training_data(self, train_data_config: OmegaConf):
        self.vocab = None
        self._train_dl = self._setup_data_loader(train_data_config)

        # Save the vocab into a text file for now
        with open('vocab.txt', 'w') as f:
            for token in self.vocab:
                f.write(f"{token}")

        # This is going to register the file into .nemo!
        # When you later use .save_to(), it will copy this file into the tar file.
        self.register_artifact('vocab_file', 'vocab.txt')

    def setup_validation_data(self, val_data_config: OmegaConf):
        vocab_file = self.register_artifact('vocab_file', 'vocab.txt')

        with open(vocab_file, 'r') as f:
            vocab = f.read().split()[:-1]  # Split based on whitespace characters
            self.vocab = vocab

        self._validation_dl = self._setup_data_loader(val_data_config)

    def setup_test_data(self, test_data_config: OmegaConf):
        # This is going to try to find the same file, and if it fails,
        # it will use the copy in .nemo
        vocab_file = self.register_artifact('vocab_file', 'vocab.txt')

        with open(vocab_file, 'r') as f:
            vocab = []
            vocab = f.read().split()[:-1]  # the -1 here is for the dangling  token in the file
            self.vocab = vocab

        self._test_dl = self._setup_data_loader(test_data_config)