[BUG] Unittests for NLP require data on internal CI machines to not fail.

Describe the bug A lot of unit tests in NLP collection (over 10) require correct version of /home/TestData folder (from internal CI machines) to be present to run successfully.

This makes it impossible to run unittests successfully anywhere but on internal NVIDIA CI machines.

To Reproduce Clone NeMo on new machine in clean environment and try running pytest tests/collections/nlp Make sure you do not have /home/TestData folder on the machine.

Expected behavior Unittest run by pytest command should run successfully, not only on CI machines. E.g. external developer/contributor should be able to run unit tests.

Stack trace/logs

library = 'sentencepiece', model_name = None, tokenizer_model = '/home/TestData/nlp/megatron_sft/tokenizer.model', vocab_file = None, merges_file = None, special_tokens = None, use_fast = False, bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None

    def get_nmt_tokenizer(
        library: str = 'yttm',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece or youtokentome
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (only supported by YTTM tokenizer) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens

        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
>           raise ValueError("No Tokenizer path provided or file does not exist!")
E           ValueError: No Tokenizer path provided or file does not exist!

nemo/collections/nlp/modules/common/tokenizer_utils.py:176: ValueError

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch version 2.*
CUDA version
NCCL version

Proposed fix

I proposed tests to either skip if /home/TestData folder isn't found, or to be re-written.

Additional context Add any other context about the problem here.

NVIDIA / Megatron-LM

[BUG] Unittests for NLP require data on internal CI machines to not fail. #600