CHTC / templates-GPUs

Template job submissions using GPUs in CHTC
MIT License
39 stars 11 forks source link

Add LLM example #28

Closed JasonLo closed 12 months ago

JasonLo commented 1 year ago

Thanks for the amazing OSG 2023 workshop. Hopefully this example is helpful.

agitter commented 1 year ago

Thanks a lot for the contribution @JasonLo. Fine-tuning LLMs should be a high-demand example.

We'll discuss who can review this.

agitter commented 1 year ago

Thanks for continuing to make changes @JasonLo.

This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb argument for my testing, it did not disable logging. My error file contained

wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main()
  File "train.py", line 80, in main
    train(args.run_name, args.use_wandb)
  File "train.py", line 61, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/transformers/src/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
    self.setup(args, state, model, **kwargs)
  File "/transformers/src/transformers/integrations.py", line 734, in setup
    self._wandb.init(
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    raise e
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
    raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)

Have you gotten this to run without the W&B logging?

JasonLo commented 1 year ago

Thanks for continuing to make changes @JasonLo.

This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb argument for my testing, it did not disable logging. My error file contained

wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main()
  File "train.py", line 80, in main
    train(args.run_name, args.use_wandb)
  File "train.py", line 61, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/transformers/src/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
    self.setup(args, state, model, **kwargs)
  File "/transformers/src/transformers/integrations.py", line 734, in setup
    self._wandb.init(
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    raise e
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
    raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)

Have you gotten this to run without the W&B logging?

This is fixed in c759b99

I've tested it; now it won't accidentally activate wandb. Job ID run with c759b99: 17010138

JasonLo commented 1 year ago

@agitter @ChristinaLK No more changes from my end. Your team can handle the merge.

JasonLo commented 12 months ago

I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.

I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources. You can see this setting in the code here: GitHub Link to train.py.