Closed othertea closed 1 year ago
Thanks for the review @pascalnotin ! I've updated the PR with the request changes, let me know if it looks good.
@pascalnotin yes, thank you for catching that! It's been fixed.
Thank you @othertea -- looks good to me, merged PR into main.
Summary
This PR contains a variety of updates to update the training process, mainly:
Additional changes
In addition to the above, some of the additional changes include:
get_training_args
returns a HuggingFaceTrainingArguments
instead of a custom pydanticBaseModel
, which had been suggested here https://github.com/OpenBioML/protein-lm-scaling/pull/26#discussion_r1299492376. We can still turn it back into aBaseModel
later when we start going beyond what is supported by HuggingFace, but I kept having to use moreTrainingArgument
fields, which required adding them into our customBaseModel
.ddp_find_unused_parameters
tofalse
in thetoy_hf.yaml
config to avoid the messageWarning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
.Yet to do
Parameter 'function'=<function get_dataset.<locals>.<lambda> at 0x7f51080e2d30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
which I believe will be resolved by merging #48 and properly incorporating that tokenizer.