KoGPT2-FineTuning

SKT-AI에서 약 20GB의 한국어 데이터를 Pre-Training 시킨 KoGPT2를 사용했습니다. 첫 번째로 가사 작사를 위해서, 저작권이 만료된 정제된 가사 데이터, 소설, 시, 기사 등을 Fine-tuning 한 결과물입니다. 아래에서, 다양한 한국어 가사를 학습한 결과를 확인 할 수 있습니다. FineTuning 작업은 여기에 있는 Code를 사용하는 것을 권장합니다.

Sample

Fine Tuning

python main.py --epoch=200 --data_file_path=./dataset/All_make_lyrics_dataset.txt --save_path=./checkpoint/ --load_path=./checkpoint/auto_enter/KoGPT2_checkpoint_18500.tar --batch_size=8

parser

parser.add_argument('--epoch', type=int, default=200,
                    help="epoch 를 통해서 학습 범위를 조절합니다.")
parser.add_argument('--save_path', type=str, default='./checkpoint/',
                    help="학습 결과를 저장하는 경로입니다.")
parser.add_argument('--load_path', type=str, default='./checkpoint/Alls/KoGPT2_checkpoint_296000.tar', 
                    help="학습된 결과를 불러오는 경로입니다.")
parser.add_argument('--samples', type=str, default="samples/",
                    help="생성 결과를 저장할 경로입니다.")
parser.add_argument('--data_file_path', type=str, default='dataset/lyrics_dataset.txt',
                    help="학습할 데이터를 불러오는 경로입니다.")
parser.add_argument('--batch_size', type=int, default=8,
                    help="batch_size 를 지정합니다.")

Use Colab

Colab을 이용해서 Fine-tuning Code를 실행할 수 있습니다.

Runtime Disconnection Prevention

function ClickConnect() {
    // 백엔드를 할당하지 못했습니다.
    // GPU이(가) 있는 백엔드를 사용할 수 없습니다. 가속기가 없는 런타임을 사용하시겠습니까?
    // 취소 버튼을 찾아서 클릭
    var buttons = document.querySelectorAll("colab-dialog.yes-no-dialog paper-button#cancel"); 
    buttons.forEach(function(btn) {
        btn.click();
    });
    console.log("1분 마다 다시 연결");
    document.querySelector("#top-toolbar > colab-connect-button").click();
}
setInterval(ClickConnect,1000*60);

Clear the screen every 10 minutes

function CleanCurrentOutput(){ 
    var btn = document.querySelector(".output-icon.clear_outputs_enabled.output-icon-selected[title$='현재 실행 중...'] iron-icon[command=clear-focused-or-selected-outputs]");
    if(btn) {
        console.log("10분 마다 출력 지우기");
        btn.click();
    }
} 
setInterval(CleanCurrentOutput,1000*60*10);

GPU Memory Check

nvidia-smi.exe

generator

python generator.py --temperature=1.0 --text_size=1000 --tmp_sent=""

표절 없음

python generator.py --temperature=5.0 --text_size=500 --tmp_sent=""

parser

parser.add_argument('--temperature', type=float, default=0.7,
                    help="temperature 를 통해서 글의 창의성을 조절합니다.")
parser.add_argument('--top_p', type=float, default=0.9,
                    help="top_p 를 통해서 글의 표현 범위를 조절합니다.")
parser.add_argument('--top_k', type=int, default=40,
                    help="top_k 를 통해서 글의 표현 범위를 조절합니다.")
parser.add_argument('--text_size', type=int, default=250,
                    help="결과물의 길이를 조정합니다.")
parser.add_argument('--loops', type=int, default=-1,
                    help="글을 몇 번 반복할지 지정합니다. -1은 무한반복입니다.")
parser.add_argument('--tmp_sent', type=str, default="사랑",
                    help="글의 시작 문장입니다.")
parser.add_argument('--load_path', type=str, default="./checkpoint/Alls/KoGPT2_checkpoint_296000.tar",
                    help="학습된 결과물을 저장하는 경로입니다.")

Use Colab

Colab을 이용해서 generator를 실행할 수 있습니다.

tensorboard

학습에 따른 변화를 확인하기 위해서, tensorboard로 접근하여 loss와 text를 확인합니다.

tensorboard --logdir=runs

loss

text

Citation

@misc{KoGPT2-FineTuning,
  author = {gyung},
  title = {KoGPT2-FineTuning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/gyunggyung/KoGPT2-FineTuning}},
}

Output

자세한 결과물은 samples에서 확인 할 수 있습니다. 학습에 대해서는 관련 포스팅에서 확인할 수 있습니다.

Reference

https://github.com/openai/gpt-2
https://github.com/nshepperd/gpt-2
https://github.com/SKT-AI/KoGPT2
https://github.com/asyml/texar-pytorch/tree/master/examples/gpt-2
https://github.com/graykode/gpt-2-Pytorch
https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
https://github.com/shbictai/narrativeKoGPT2
https://github.com/ssut/py-hanspell
https://github.com/likejazz/korean-sentence-splitter

forus-ai / KoGPT2-Lyrics-Generation-FineTuning-Version1

readme

KoGPT2-FineTuning

Sample

Fine Tuning

parser

Use Colab

Runtime Disconnection Prevention

Clear the screen every 10 minutes

GPU Memory Check

generator

표절 없음

parser

Use Colab

tensorboard

loss

text

Citation

Output

Reference