모델 : monologg/koelectra-base-v3-discriminator 0.91x lr=0.00002860270719188072 weight_decay = 0.5, batch_size=16 (성공)

모델 : lighthouse/mdeberta-v3-base-kor-further 0.91x lr=0.00002340865224868444 weight_decay=0.5, batch_size=8 (성공) 전처리 기법 def preprocess_text(self,text):

normalize repeated characters using soynlp library

text = repeat_normalize(text, num_repeats=2)
# remove stopwords
#text = ' '.join([token for token in text.split() if not token in stopwords])
# remove special characters and numbers
# text = re.sub('[^가-힣 ]', '', text)
# text = re.sub('[^a-zA-Zㄱ-ㅎ가-힣]', '', text)
# tokenize text using soynlp tokenizer
tokens = Regextokenizer.tokenize(text)
# lowercase all tokens
tokens = [token.lower() for token in tokens]
# join tokens back into sentence
text = ' '.join(tokens)
# kospacing_sent = spacing(text)
return text

모델 훈련 mdeberta 약 8에포크 훈련후 7에포크 사용 1-2. Dev 데이터셋으로 동일 데이터셋 전처리 적용 후 lr 1/10 줄인 후 2 epoch 학습 2 모델 훈련 koelectra 약 10에포크 훈련후 6에포크 사용 2-1. Dev 데이터셋으로 동일 데이터셋 전처리 적용 후 lr 1/10 줄인 후 2 epoch 학습

ESNB코드로 앙상블

boostcampaitech5 / level1_semantictextsimilarity-nlp-11

[Test score : 0.9222] #23

normalize repeated characters using soynlp library