SKTBrain / KoBERT

Korean BERT pre-trained cased (KoBERT)
Apache License 2.0
1.28k stars 371 forks source link

[BUG] #107

Open kotran88 opened 1 year ago

kotran88 commented 1 year ago

๐Ÿ› Bug

๊ธฐ๋ณธ ๊ฐ์ •๋ถ„๋ฅ˜ ์˜ˆ์™€ https://www.dinolabs.ai/271 ์ด ์˜ˆ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฒ€์ƒ‰ํ•˜๋ฉฐ ์˜ค๋ฅ˜ ์ˆ˜์ •ํ•˜๋ฉด์„œ ํ•˜๊ณ ์žˆ์Šต๋‹ˆ๋‹ค..

colab์œผ๋กœ ํ•˜๊ณ ์žˆ์Šต๋‹ˆ๋‹ค.

!pip install mxnet !pip install gluonnlp pandas tqdm !pip install sentencepiece==0.1.91 !pip install transformers==4.8.2 !pip install torch !pip install gluonnlp==0.10.0

!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook
from kobert_tokenizer import KoBERTTokenizer
from transformers import BertModel
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm.notebook import tqdm
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
bertmodel = BertModel.from_pretrained('skt/kobert-base-v1', return_dict=False)
vocab = nlp.vocab.BERTVocab.from_sentencepiece(tokenizer.vocab_file, padding_token='[PAD]')

def get_kobert_model(model_path, vocab_file, ctx="cpu"):
    bertmodel = BertModel.from_pretrained(model_path)
    device = torch.device(ctx)
    bertmodel.to(device)
    bertmodel.eval()
    vocab_b_obj = nlp.vocab.BERTVocab.from_sentencepiece(vocab_file,
                                                         padding_token='[PAD]')
    return bertmodel, vocab_b_obj
bertmodel, vocab = get_kobert_model('skt/kobert-base-v1',tokenizer.vocab_file)

from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
chatbot_data = pd.read_excel('drive/MyDrive/korean.xlsx')

from sklearn.model_selection import train_test_split
dataset_train, dataset_test = train_test_split(data_list, test_size=0.25, random_state=0)
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)
        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)
data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False)
data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

๋งˆ์ง€๋ง‰๋ถ€๋ถ„์—์„œ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.


TypeError                                 Traceback (most recent call last)
[<ipython-input-60-1574bbdbfa0b>](https://localhost:8080/#) in <cell line: 2>()
      1 tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)
----> 2 data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False)
      3 data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

10 frames
[/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py](https://localhost:8080/#) in LoadFromFile(self, arg)
    308 
    309     def LoadFromFile(self, arg):
--> 310         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    311 
    312     def _EncodeAsIds(self, text, enable_sampling, nbest_size, alpha, add_bos, add_eos, reverse, emit_unk_piece):

TypeError: not a string
## To Reproduce
<!-- ๋งŒ์•ฝ์— ์ฝ”๋“œ ์ƒ˜ํ”Œ, ์—๋Ÿฌ ๋ฉ”์„ธ์ง€, ์Šคํƒ ํŠธ๋ ˆ์ด์Šค ๋“ฑ์ด ์žˆ๋‹ค๋ฉด ์ด๋ฅผ ์ฒจ๋ถ€ํ•ด์ฃผ์„ธ์š”-->

๋ฒ„๊ทธ๋ฅผ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์žฌํ˜„์ ˆ์ฐจ๋ฅผ ์ž‘์„ฑํ•ด์ฃผ์„ธ์š”.

1. -
2. -
3. -

## Expected behavior
<!-- ๋ฒ„๊ทธ๊ฐ€ ๋ฐœ๊ฒฌ๋˜๊ธฐ ์ด์ „์— ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ–ˆ์„ ๊ฒฝ์šฐ์— ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ƒํ–ˆ๋Š”์ง€ ์ž‘์„ฑํ•ด์ฃผ์„ธ์š”.-->

## Environment
google colab tpu 

## Additional context
<!-- ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๊ฐ€ ์žˆ๋‹ค๋ฉด ์„œ์ˆ ํ•ด์ฃผ์„ธ์š”.-->
ChangZero commented 1 year ago

@kotran88 ์•„๋ž˜ ์ฝ”๋“œ ์ฐธ๊ณ ํ•ด๋ณด์‹œ๋ฉด ์ข‹์„๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค. https://github.com/ChangZero/koBERT-finetuning-demo/blob/main/kobert_colab.ipynb

Jhyunee commented 5 months ago

ํ˜น์‹œ ํ•ด๊ฒฐํ•˜์…จ๋‚˜์š”? ์ €๋„ ๊ฐ™์€ ์—๋Ÿฌ๋ฅผ ๋ชป๊ณ ์น˜๊ณ  ์žˆ์–ด์„œ,,ใ… ใ