[BUG] Unittests rely on data from internal CI machines

okuchaiev commented 11 months ago

Describe the bug A lot of unit tests in NLP collection (over 10) require correct version of /home/TestData folder (from internal CI machines) to be present to run successfully.

This makes it impossible to run unittests successfully anywhere but on internal NVIDIA CI machines.

To Reproduce Clone NeMo on new machine in clean environment and try running pytest tests/collections/nlp Make sure you do not have /home/TestData folder on the machine.

Expected behavior Unittest run by pytest command should run successfully, not only on CI machines. E.g. external developer/contributor should be able to run unit tests.

Stack trace/logs

library = 'sentencepiece', model_name = None, tokenizer_model = '/home/TestData/nlp/megatron_sft/tokenizer.model', vocab_file = None, merges_file = None, special_tokens = None, use_fast = False, bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None

    def get_nmt_tokenizer(
        library: str = 'yttm',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece or youtokentome
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (only supported by YTTM tokenizer) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens

        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
>           raise ValueError("No Tokenizer path provided or file does not exist!")
E           ValueError: No Tokenizer path provided or file does not exist!

nemo/collections/nlp/modules/common/tokenizer_utils.py:176: ValueError

Environment (please complete the following information):

PyTorch version 2.* CUDA version NCCL version

Proposed fix

I proposed tests to either skip if /home/TestData folder isn't found, or to be re-written.

Additional context Add any other context about the problem here.

ericharper commented 11 months ago

Fix for this is #7943

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 10 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

BlueCloudDev commented 5 months ago

Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 46, in <module>
    main()
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 40, in main
    model = MegatronGPTModel(cfg.model, trainer)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 273, in __init__
    super().__init__(cfg, trainer=trainer, no_lm_init=True)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 221, in __init__
    self._build_tokenizer()
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 421, in _build_tokenizer
    self.tokenizer = get_nmt_tokenizer(
  File "/opt/NeMo/nemo/collections/nlp/modules/common/tokenizer_utils.py", line 175, in get_nmt_tokenizer
    raise ValueError("No Tokenizer path provided or file does not exist!")
ValueError: No Tokenizer path provided or file does not exist!

Getting this error when trying to run nemotron. Has this issue been resolved? How can I ensure the tokenizer is loaded?

NVIDIA / NeMo

[BUG] Unittests rely on data from internal CI machines #7929