Model card for SteerLM has incorrect command-line arguments

ndronen commented 1 year ago

Describe the bug

The instructions for launching the Steer-LM eval server in the HuggingFace model card are incorrect.

Steps/Code to reproduce bug

Follow instructions to install prerequisites.
Download LLAMA2-13B-SteerLM.nemo.
Run the documented command: python megatron_gpt_eval.py gpt_model_file=LLAMA2-13B-SteerLM.nemo trainer.precision=16 server=True tensor_model_parallel_size=4 trainer.devices=1 pipeline_model_parallel_split_rank=0

Expected behavior

The expected behavior is that the eval server starts or, if the system resources are insufficient, an error occurs.

Ideally, the model card will say how to run the eval server depending on the available GPU memory and number of GPUs. I'd like to be able to run this on a 4xV100 machine.

Instead, I see the following:

python megatron_gpt_eval.py gpt_model_file=LLAMA2-13B-SteerLM.nemo trainer.precision=16 server=True tensor_model_parallel_size=4 trainer.devices=1 pipeline_model_parallel_split_rank=0
[NeMo W 2023-11-11 02:15:43 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-11-11 02:15:43 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-11-11 02:15:48 nemo_logging:349] /home/ubuntu/venv/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Error executing job with overrides: ['gpt_model_file=LLAMA2-13B-SteerLM.nemo', 'trainer.precision=16', 'server=True', 'tensor_model_parallel_size=4', 'trainer.devices=1', 'pipeline_model_parallel_split_rank=0']
Traceback (most recent call last):
  File "/home/ubuntu/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 161, in main
    cfg.trainer.devices * cfg.trainer.num_nodes
AssertionError: devices * num_nodes should equal tensor_model_parallel_size * pipeline_model_parallel_size

Environment overview

Environment location: AWS EC2
Method of NeMo install: pip install -e . in HEAD detached at v1.17.0
If method of install is [Docker], provide docker pull & docker run commands used

Environment details

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:    22.04
Codename:   jammy

>>> torch.__version__
'2.1.0+cu121'

python -V
Python 3.10.12

Additional context

Example: 4xV100

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

Model card for SteerLM has incorrect command-line arguments #7877