Lacking instructions for inference

kroggen commented 2 weeks ago

I am using this command to try inference:

python generate.py --load pytorch_model.bin --tokenizer-type HFTokenizer --vocab-file tokenizer.json --text_gen_type interactive --temperature 0.0 --maximum_tokens 200 configs/tokenformer/1-5B_eval.yml

But it fails with:

[2024-11-07 01:10:21,644] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer

Traceback (most recent call last):
  File "/root/TokenFormer/generate.py", line 93, in <module>
    main()
  File "/root/TokenFormer/generate.py", line 33, in main
    model, neox_args = setup_for_inference_or_eval(
  File "/root/TokenFormer/megatron/utils.py", line 425, in setup_for_inference_or_eval
    from megatron.training import setup_model_and_optimizer, setup_model_for_eval
  File "/root/TokenFormer/megatron/training.py", line 50, in <module>
    from megatron.data.data_utils import build_train_valid_test_data_iterators
ModuleNotFoundError: No module named 'megatron.data'

There is no data folder inside megatron folder. And I suspect that build_train_valid_test_data_iterators is not required for inference either...

Please share working instructions for inference

Haiyang-W commented 2 weeks ago

I'm very sorry, I haven't tried inference with this code, I've only ensured that eval works without issues. You're welcome to implement this interface on your end and integrate it into our code base by PR. :)

SteadySurfdom commented 2 weeks ago

@kroggen I got the same error while trying eval, I did find a proxy for that specific function in the official NVIDIA repo on Megatron-LM, here. Although after getting a proxy for that func I again got into a similar error and this time, I wasn't able to find a proxy. Traceback:

Traceback (most recent call last):
  File "eval.py", line 77, in <module>
    main()
  File "eval.py", line 35, in main
    model, neox_args = setup_for_inference_or_eval(
  File "/home/steadysurfdom/Research/TokenFormer/megatron/utils.py", line 455, in setup_for_inference_or_eval
    initialize_megatron(neox_args)
  File "/home/steadysurfdom/Research/TokenFormer/megatron/initialize.py", line 107, in initialize_megatron
    from megatron.data.data_utils import compile_helper
ModuleNotFoundError: No module named 'megatron.data'
[2024-11-10 12:28:49,719] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 85925

Not so useful but eh, worth adding in the conversation I guess.

kroggen commented 2 weeks ago

Maybe @Haiyang-W can fix the code for eval first, as this is required for anyone interested in checking the results from the paper

My intention was to compare the output of my minimal implementation with the main model, to check if it was done correctly. The output is not so good. Maybe is it because it was trained with few tokens? Because there are models with 1.5B size that are pretty good.

Anyway, it would be good to have at least a working method for evaluation

Haiyang-W commented 2 weeks ago

@kroggen I got the same error while trying eval, I did find a proxy for that specific function in the official NVIDIA repo on Megatron-LM, here. Although after getting a proxy for that func I again got into a similar error and this time, I wasn't able to find a proxy. Traceback:
Traceback (most recent call last):
  File "eval.py", line 77, in <module>
    main()
  File "eval.py", line 35, in main
    model, neox_args = setup_for_inference_or_eval(
  File "/home/steadysurfdom/Research/TokenFormer/megatron/utils.py", line 455, in setup_for_inference_or_eval
    initialize_megatron(neox_args)
  File "/home/steadysurfdom/Research/TokenFormer/megatron/initialize.py", line 107, in initialize_megatron
    from megatron.data.data_utils import compile_helper
ModuleNotFoundError: No module named 'megatron.data'
[2024-11-10 12:28:49,719] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 85925
Not so useful but eh, worth adding in the conversation I guess.

Sorry for that, due to the .gitignore, I miss the data dir in megatron. Now it's fixed. If you have any questions, feel free to tell me.

Haiyang-W commented 2 weeks ago

If anyone encounters problems with eval, please raise an issue in time and I will reply as soon as possible (within one day).

Haiyang-W commented 2 weeks ago

Maybe @Haiyang-W can fix the code for eval first, as this is required for anyone interested in checking the results from the paper

My intention was to compare the output of my minimal implementation with the main model, to check if it was done correctly. The output is not so good. Maybe is it because it was trained with few tokens? Because there are models with 1.5B size that are pretty good.

Anyway, it would be good to have at least a working method for evaluation

Can you check it again? I have uploaded the data dir. Very sorry for missing that part due to .gitignore file. :)

kroggen commented 2 weeks ago

Yes, the eval worked now. Thanks

This is the result for the 150M model:

 'results': {'arc_challenge': {'acc,none': 0.19880546075085323,
                               'acc_norm,none': 0.24573378839590443,
                               'acc_norm_stderr,none': 0.012581033453730107,
                               'acc_stderr,none': 0.011662850198175543},
             'arc_easy': {'acc,none': 0.476010101010101,
                          'acc_norm,none': 0.4187710437710438,
                          'acc_norm_stderr,none': 0.01012348716016781,
                          'acc_stderr,none': 0.010247967392742684},
             'hellaswag': {'acc,none': 0.30989842660824535,
                           'acc_norm,none': 0.35480979884485164,
                           'acc_norm_stderr,none': 0.004774778180345218,
                           'acc_stderr,none': 0.004615063817741879},
             'lambada_openai': {'acc,none': 0.4506112943916165,
                                'acc_stderr,none': 0.006931910914621461,
                                'perplexity,none': 16.382835662797227,
                                'perplexity_stderr,none': 0.5351178238219353},
             'piqa': {'acc,none': 0.6528835690968444,
                      'acc_norm,none': 0.6441784548422198,
                      'acc_norm_stderr,none': 0.011170294934656941,
                      'acc_stderr,none': 0.011107104993128088},
             'winogrande': {'acc,none': 0.5043409629044988,
                            'acc_stderr,none': 0.014051956064076903}},

And for the 1.5B model:

 'results': {'arc_challenge': {'acc,none': 0.3037542662116041,
                               'acc_norm,none': 0.3216723549488055,
                               'acc_norm_stderr,none': 0.013650488084494166,
                               'acc_stderr,none': 0.013438909184778766},
             'arc_easy': {'acc,none': 0.648989898989899,
                          'acc_norm,none': 0.5976430976430976,
                          'acc_norm_stderr,none': 0.010062244711011532,
                          'acc_stderr,none': 0.009793703885101042},
             'hellaswag': {'acc,none': 0.45339573790081655,
                           'acc_norm,none': 0.5986855208125871,
                           'acc_norm_stderr,none': 0.004891626718097012,
                           'acc_stderr,none': 0.004968058944472159},
             'lambada_openai': {'acc,none': 0.646613623132156,
                                'acc_stderr,none': 0.006659772589635509,
                                'perplexity,none': 5.2395373300285355,
                                'perplexity_stderr,none': 0.12567420744987814},
             'piqa': {'acc,none': 0.7453754080522307,
                      'acc_norm,none': 0.7377584330794341,
                      'acc_norm_stderr,none': 0.010262502565172449,
                      'acc_stderr,none': 0.01016443223706048},
             'winogrande': {'acc,none': 0.5951065509076559,
                            'acc_stderr,none': 0.013795927003124943}},

Haiyang-W commented 2 weeks ago

Thanks for your checking! Nice.

Haiyang-W / TokenFormer

Lacking instructions for inference #7