NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.3k stars 2.55k forks source link

Model card for SteerLM has incorrect command-line arguments #7877

Closed ndronen closed 11 months ago

ndronen commented 1 year ago

Describe the bug

The instructions for launching the Steer-LM eval server in the HuggingFace model card are incorrect.

Steps/Code to reproduce bug

  1. Follow instructions to install prerequisites.
  2. Download LLAMA2-13B-SteerLM.nemo.
  3. Run the documented command: python megatron_gpt_eval.py gpt_model_file=LLAMA2-13B-SteerLM.nemo trainer.precision=16 server=True tensor_model_parallel_size=4 trainer.devices=1 pipeline_model_parallel_split_rank=0

Expected behavior

The expected behavior is that the eval server starts or, if the system resources are insufficient, an error occurs.

Ideally, the model card will say how to run the eval server depending on the available GPU memory and number of GPUs. I'd like to be able to run this on a 4xV100 machine.

Instead, I see the following:

python megatron_gpt_eval.py gpt_model_file=LLAMA2-13B-SteerLM.nemo trainer.precision=16 server=True tensor_model_parallel_size=4 trainer.devices=1 pipeline_model_parallel_split_rank=0
[NeMo W 2023-11-11 02:15:43 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-11-11 02:15:43 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-11-11 02:15:48 nemo_logging:349] /home/ubuntu/venv/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Error executing job with overrides: ['gpt_model_file=LLAMA2-13B-SteerLM.nemo', 'trainer.precision=16', 'server=True', 'tensor_model_parallel_size=4', 'trainer.devices=1', 'pipeline_model_parallel_split_rank=0']
Traceback (most recent call last):
  File "/home/ubuntu/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 161, in main
    cfg.trainer.devices * cfg.trainer.num_nodes
AssertionError: devices * num_nodes should equal tensor_model_parallel_size * pipeline_model_parallel_size

Environment overview

Environment details

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:    22.04
Codename:   jammy
>>> torch.__version__
'2.1.0+cu121'
python -V
Python 3.10.12

Additional context

Example: 4xV100

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.