Closed JasonLo closed 12 months ago
Thanks a lot for the contribution @JasonLo. Fine-tuning LLMs should be a high-demand example.
We'll discuss who can review this.
Thanks for continuing to make changes @JasonLo.
This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb
argument for my testing, it did not disable logging. My error file contained
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing. If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
File "train.py", line 84, in <module>
main()
File "train.py", line 80, in main
train(args.run_name, args.use_wandb)
File "train.py", line 61, in train
trainer.train(resume_from_checkpoint=last_checkpoint)
File "/transformers/src/transformers/trainer.py", line 1544, in train
return inner_training_loop(
File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
self.setup(args, state, model, **kwargs)
File "/transformers/src/transformers/integrations.py", line 734, in setup
self._wandb.init(
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
raise e
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
run = wi.init()
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing. If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Have you gotten this to run without the W&B logging?
Thanks for continuing to make changes @JasonLo.
This comment from my testing was buried in a commit suggestion. When I removed the
--use_wandb
argument for my testing, it did not disable logging. My error file containedwandb: W&B API key is configured. Use `wandb login --relogin` to force relogin wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>) wandb: ERROR The API key you provided is either invalid or missing. If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized) Traceback (most recent call last): File "train.py", line 84, in <module> main() File "train.py", line 80, in main train(args.run_name, args.use_wandb) File "train.py", line 61, in train trainer.train(resume_from_checkpoint=last_checkpoint) File "/transformers/src/transformers/trainer.py", line 1544, in train return inner_training_loop( File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event result = getattr(callback, event)( File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin self.setup(args, state, model, **kwargs) File "/transformers/src/transformers/integrations.py", line 734, in setup self._wandb.init( File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init raise e File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init run = wi.init() File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init raise error wandb.errors.AuthenticationError: The API key you provided is either invalid or missing. If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Have you gotten this to run without the W&B logging?
This is fixed in c759b99
I've tested it; now it won't accidentally activate wandb. Job ID run with c759b99: 17010138
@agitter @ChristinaLK No more changes from my end. Your team can handle the merge.
I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.
I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources. You can see this setting in the code here: GitHub Link to train.py.
Thanks for the amazing OSG 2023 workshop. Hopefully this example is helpful.