Closed jinmang2 closed 2 years ago
OPTUNA를 학습에 적용시킨 코드입니다! optuna==2.9.1 버전을 사용했습니다. 추가적인 논의 사항이나 질문 남겨주세요!
import optuna
if __name__ == '__main__':
if args.optuna:
def train_optuna(trial):
# optuna Setting
args.epochs = trial.suggest_int('n_epochs', optuna_epoch_min, optuna_epoch_max)
args.lr = trial.suggest_loguniform('lr', optuna_lr_min, optuna_lr_max)
args.optimizer = trial.suggest_categorical('optimizer', optuna_optimizer)
print(args)
return train(data_dir, model_dir, args)
study = optuna.create_study(direction='maximize')
study.optimize(train_optuna, n_trials=optuna_ntrials)
print(f"Best F1 Val: {study.best_trial.value}\n Params\n {study.best_trial.params}\n Save at {model_dir}/optuna.json")
with open(os.path.join(model_dir, f'optuna_{args.name}_{study.best_trial.value}.json'), 'w', encoding='utf-8') as f:\
json.dump(study.best_trial.params, f, ensure_ascii=False, indent=4)
else:
print(args)
train(data_dir, model_dir, args)
Token Embedding size : 베이스라인은 기본 32000이지만, 엔티티의 스페셜토큰 추가하면서 32004가 됨. 이때 1회차 실행 후에 2회차 실행시에 pretrained weight를 가져오면서 임베딩 레이어 사이즈가 변경되지 않은 모델을 불러오며 실행 불가.
...
(pid=22448) File "/opt/conda/envs/basic/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1576, in _load_state_dict_into_model
(pid=22448) raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
(pid=22448) RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:
(pid=22448) size mismatch for roberta.embeddings.word_embeddings.weight: copying a param with shape torch.Size([32000, 768]) from checkpoint, the shape in current model is torch.Size([32004, 768]).
...
trial에 assignment
가 누락되었다는 오류, huggingface trainer.py
에서 발생함.
...
(pid=23933) File "/opt/conda/envs/basic/lib/python3.8/site-packages/transformers/integrations.py", line 183, in _objective
(pid=23933) local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
(pid=23933) File "/opt/conda/envs/basic/lib/python3.8/site-packages/transformers/trainer.py", line 1241, in train
(pid=23933) self.state.trial_params = hp_params(trial.assignments) if trial is not None else None
(pid=23933) AttributeError: 'dict' object has no attribute 'assignments'
...
# self.state.trial_params = hp_params(trial.assignments) if trial is not None else None
self.state.trial_params = hp_params(trial) if trial is not None else None
최종 실행 코드는 얻었으나 population_based training은 checkpoint 저장을 필수로 하기 때문에 서버용량을 빠르게 잡아먹는 문제.
@KimDaeUng 님 혹시 code commit hash로 링크 걸어주실 수 있으신가요??
기타