feat: add validation split, wandb logging, multigpu compatibility

othertea commented 1 year ago

Summary

This PR contains a variety of updates to update the training process, mainly:

Provides a way to create and use a (optional) validation/test set that was randomly chosen. Note we can update this later once we have non-random splits, as is planned in #18 .
Provides a way to log metrics to weights and biases.
make some changes to be compatible with single-node multi-gpu training.

In addition to the above, some of the additional changes include:

No longer streaming HuggingFace datasets.
get_training_args returns a HuggingFace TrainingArguments instead of a custom pydantic BaseModel, which had been suggested here https://github.com/OpenBioML/protein-lm-scaling/pull/26#discussion_r1299492376. We can still turn it back into a BaseModel later when we start going beyond what is supported by HuggingFace, but I kept having to use more TrainingArgument fields, which required adding them into our custom BaseModel.
Directions on the README for how to launch a single-node multi-gpu job on the stability cluster, which I have verified runs. Note that this is still a toy run.
Setting ddp_find_unused_parameters to false in the toy_hf.yaml config to avoid the message Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()).

Training with this new code will print Parameter 'function'=<function get_dataset.<locals>.<lambda> at 0x7f51080e2d30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. which I believe will be resolved by merging #48 and properly incorporating that tokenizer.
This PR does not currently contain a config yaml to do a proper run, only toy runs.

othertea commented 1 year ago

Thanks for the review @pascalnotin ! I've updated the PR with the request changes, let me know if it looks good.

othertea commented 1 year ago

@pascalnotin yes, thank you for catching that! It's been fixed.

pascalnotin commented 1 year ago

Thank you @othertea -- looks good to me, merged PR into main.