Describe the bug
A lot of unit tests in NLP collection (over 10) require correct version of /home/TestData folder (from internal CI machines) to be present to run successfully.
This makes it impossible to run unittests successfully anywhere but on internal NVIDIA CI machines.
To Reproduce
Clone NeMo on new machine in clean environment and try running pytest tests/collections/nlp Make sure you do not have /home/TestData folder on the machine.
Expected behavior
Unittest run by pytest command should run successfully, not only on CI machines.
E.g. external developer/contributor should be able to run unit tests.
Stack trace/logs
library = 'sentencepiece', model_name = None, tokenizer_model = '/home/TestData/nlp/megatron_sft/tokenizer.model', vocab_file = None, merges_file = None, special_tokens = None, use_fast = False, bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
def get_nmt_tokenizer(
library: str = 'yttm',
model_name: Optional[str] = None,
tokenizer_model: Optional[str] = None,
vocab_file: Optional[str] = None,
merges_file: Optional[str] = None,
special_tokens: Optional[Dict[str, str]] = None,
use_fast: Optional[bool] = False,
bpe_dropout: Optional[float] = 0.0,
r2l: Optional[bool] = False,
legacy: Optional[bool] = False,
delimiter: Optional[str] = None,
):
"""
Args:
model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
tokenizer_model: tokenizer model file of sentencepiece or youtokentome
special_tokens: dict of special tokens
vocab_file: path to vocab file
use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
bpe_dropout: (only supported by YTTM tokenizer) BPE dropout tries to corrupt the standard segmentation procedure
of BPE to help model better learn word compositionality and become robust to segmentation errors.
It has empirically been shown to improve inference time BLEU scores.
r2l: Whether to return subword IDs from right to left
"""
if special_tokens is None:
special_tokens_dict = {}
else:
special_tokens_dict = special_tokens
if (library != 'byte-level') and (
model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
):
> raise ValueError("No Tokenizer path provided or file does not exist!")
E ValueError: No Tokenizer path provided or file does not exist!
nemo/collections/nlp/modules/common/tokenizer_utils.py:176: ValueError
Environment (please complete the following information):
Megatron-LM commit ID
PyTorch version 2.*
CUDA version
NCCL version
Proposed fix
I proposed tests to either skip if /home/TestData folder isn't found, or to be re-written.
Additional context
Add any other context about the problem here.
Describe the bug A lot of unit tests in NLP collection (over 10) require correct version of
/home/TestData
folder (from internal CI machines) to be present to run successfully.This makes it impossible to run unittests successfully anywhere but on internal NVIDIA CI machines.
To Reproduce Clone NeMo on new machine in clean environment and try running
pytest tests/collections/nlp
Make sure you do not have/home/TestData
folder on the machine.Expected behavior Unittest run by
pytest
command should run successfully, not only on CI machines. E.g. external developer/contributor should be able to run unit tests.Stack trace/logs
Environment (please complete the following information):
Proposed fix
I proposed tests to either skip if
/home/TestData
folder isn't found, or to be re-written.Additional context Add any other context about the problem here.