`LSTM` 구현: 왜 로스가 수렴하지 않을까?

eubinecto commented 2 years ago

To-do's

[x] ainze에서 학습 진행, 로스가 RNN보다 더 줄어드는지 확인

eubinecto commented 2 years ago

로스가 크게 줄어들지는 않는다 - plateau 현상 발생

https://wandb.ai/eubinecto/the-clean-rnns/runs/2auoujj3?workspace=user-eubinecto

애초에 LSTM도 0으로 converge하지를 않는다. 왜일까? 어찌되었든 train_loss는 0에 수렴해야하는데.. 이부분에 대한 해결이 필요하다.

eubinecto commented 2 years ago

질문 - 초기 가중치 설정을 잘못한 것인가?

실험

xavier uniform으로 초기 가중치를 업데이트 해보자:

    def on_train_start(self):
        # deep models should be initialised with so-called "Xavier initialisation"
        # refer to: https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
        for param in tqdm(self.parameters(), desc="initialising weights..."):
            if param.dim() > 1:
                torch.nn.init.xavier_uniform_(param)

결과

https://wandb.ai/eubinecto/the-clean-rnns/runs/2au83hcm?workspace=user-eubinecto

0.69언저리에서 멈춰버리는 것은 여전하다.

회고

다른 곳에 분명 문제가 있다. 우선, 학습에 많은 시간이 걸려 빠르게 디버깅을 하는 것이 힘드니, 전체 데이터셋이 아니라 데이터셋의 일부부만 과적합을 시킬 수 있는 방법이 필요하다. 5%도 과적합이 불가능하다면 코드 어딘가에 문제가 있다고 볼 수 있기 때문.

shorten epochs https://pytorch-lightning.readthedocs.io/en/stable/common/debugging.html#shorten-epochs
해결책?

아직 해결책은 모르겠다. 일단 정보를 더 얻어내기 위해, 전체 학습데이터의 1%만 학습할때도 로스가 수렴하지 않는지 확인해보자.

eubinecto commented 2 years ago

질문 - 데이터의 1%도 학습하지 못하는가?

실험

데이터의 1%에만 핏을 해보자

python3 run_train.py eubinecto lstm_for_classification --limit_train_batches=0.1 --limit_val_batches=0.1

결과

달라지는 것이 없다. 여전히 수렴하지 않는다.

회고

그렇다면 데이터의 크기는 핵심변수가 아니다. 데이터가 크던 작던 학습을 하지 못한다.

해결책?

모델의 문제일수도 있겠으나, 깊게 들어가기전에 더 쉬운 것들을 먼저 체크하자. learning rate의 문제일수도 있다. learning rate가 매우 클 때, 현재일 때, 매우 작을 때, 어떻게 변하는지 한번 살펴보자.

eubinecto commented 2 years ago

`lr`가 너무 크거나 너무 작은 것이 문제인가?

실험

lr = 0.1
lr = 0.00001
lr = 0.0000001 로 두고 데이터의 1%에 실험을 해보자.

결과

lr	결과	링크
0.1		https://wandb.ai/eubinecto/the-clean-rnns/runs/3r1ddk1l?workspace=user-eubinecto
0.00001		https://wandb.ai/eubinecto/the-clean-rnns/runs/3snkuz90?workspace=user-eubinecto
0.0000001		https://wandb.ai/eubinecto/the-clean-rnns/runs/rzgey2ga?workspace=user-eubinecto

회고

음.. 어느 쪽이든 별 도움이 되지 않는다.

해결책?

아, 그런데 이런 비슷한 패턴을 예전에 트랜스포머를 구현할 때도 확인했었는데, 그때의 문제점은 패딩 토큰을 무시하지 않는 것이었다. LSTM도 분명 비슷한 문제를 겪는 중이라고 본다. padding token이 149개이고 단어가 1개인 경우, last의 정보 무의미한 패딩 토큰 정보로 희석될 수가 있다.

찾아보니 이런 논문도 있다 https://arxiv.org/abs/1903.07288

padding을 앞에, 뒤에 두느냐에 따라, 성능이 천차만별이다.

때문에, padding strategy를 post가 아닌 pre로 하는 것이 바람직하다.

eubinecto commented 2 years ago

padding strategy가 문제인가?

실험

post-padding -> pre-padding

    tokenizer.enable_padding(pad_token=pad_token,
                             direction="left",
                             pad_id=tokenizer.token_to_id(pad_token),  # noqa
                             length=max_length)

결과

빙고!

https://wandb.ai/eubinecto/the-clean-rnns/runs/3snkuz90?workspace=user-eubinecto

회고

역시..언제나 문제는 패딩이다... 허허.

해결책?

이제 전체 데이터셋에 돌려보자 - learning rate를 조금 크게 설정해도 괜찮을 것 같다.

eubinecto commented 2 years ago

다시 학습을 진행한 결과

모델	f1	링크
rnn		https://wandb.ai/eubinecto/the-clean-rnns/runs/40ca3shv?workspace=user-eubinecto
lstm		https://wandb.ai/eubinecto/the-clean-rnns/runs/25wm1ome?workspace=user-eubinecto

확실히 두 모델 모두 성능이 대폭 개선된 것을 볼수 있다. 패딩... 조심하자!

eubinecto / train-of-thoughts