boostcampaitech2 / klue-level2-nlp-06

KLUE-RE - Relation Extraction
5 stars 2 forks source link
klue re relation-extraction

BoostCamp AI Tech - [NLP] 문장 내 개체간 관계 추출

       

Relation Extraction For Korean

RE is a task to identify semantic relations between entity pairs in a text. The relation is defined between an entity pair consisting of subject entity and object entity. The goal is then to pick an appropriate relationship between these two entities in Korean sentence.

Project Overview

프로젝트 목표

데이터셋

데이터 전처리

평가지표

Table of Contents

  1. Prerequisites Installatioin
  2. Quick Start
  3. Best Score Model
  4. Model Architecture
  5. Usage
  6. Augmenters
  7. Contributor

1. Prerequisites Installatioin

requirements.txt can be installed using pip as follows:

$ pip install -r requirements.txt

2. Quick Start

3. Best Score Model

4. Model Architecture

5. Usage

Using Focal loss

    "focal_loss":{
        "true" : True,
        "alpha" : 0.1,
        "gamma" : 0.25
      },

Using Imbalanced Sampler

"Trainer" : {
      "use_imbalanced_sampler" : true 
    },

Using Tokenize like BERT

BERT result [CLS] the man went to [MASK] store [SEP]he bought a gallon [MASK] milk [SEP] LABEL = IsNext like BERT result [CLS][obj] 변정수[/obj] 씨는 1994년 21살의 나이에 7살 연상 남편과 결혼해 슬하에 두 딸 [subj]유채원[/subj], 유정원 씨를 두고 있다. [SEP][obj][PER]변정수[/obj][subj][PER]유채원[/subj] [SEP]

"dataPP" :{ 
    "active" : true,
    "entityInfo" : "entity&token",
    "sentence" : "entity"
},

AEDA

"aeda" : "None"

Default 하위 15개 label에 대해 AEDA 적용 (Mecab 설치필요)

"aeda" : "default"

Mecab 설치방법

sudo apt install g++
sudo apt update
sudo apt install default-jre
sudo apt install default-jdk
pip install konlpy

# install khaiii
cd ~
git clone https://github.com/kakao/khaiii.git
cd khaiii
mkdir build
cd build
pip install cmake
sudo apt-get install cmake
cmake ..
make resource
sudo make install
make package_python
cd package_python
pip install .
cd ~
apt-get install locales
locale-gen en_US.UTF-8
pip install tweepy==3.7.0
# install mecab
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
tar xvfz mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2
./configure
make
make check
sudo make install
sudo ldconfig
cd ~
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
tar xvfz mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720
./configure
make
sudo make install
cd ~
mecab -d /usr/local/lib/mecab/dic/mecab-ko-dic
apt install curl
apt install git
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
pip install mecab-python

Custom Mecab 설치 불필요

# of no_relation * 0.4 보다 적은 데이터를 가지는 label에 대해서 augmantation 실행 sentence를 space(' ')기준으로 나눈 후 entity에 해당하는 데이터를 합친 후 aeda 적용

"aeda" : "custom"

aug family

모델이 헷갈려 하는 가족 관계 레이블에 대한 augmentation

"aug_family" = true

typed entity

An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen

"type_ent_marker" = true

typed punct

An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen

"type_punct" = true

6. Config Augmenters

Wandb

Argument DataType Default Help
name str "roberta_large_stratified" Wandb model Name
tags list ["ROBERT_LARGE", "stratified", "10epoch"] Wandb Tags
group str "ROBERT_LARGE" Wandb group Name
Argument DataType Default Help
name str "XLM-RoBERTa-large" Wandb model Name
tags list ["XLM-RoBERTa-large", "stratified", "10epoch"] Wandb Tags
group str "XLM-RoBERTa-large" Wandb group Name

Focal Loss

Argument DataType Default Help
true bool false Using Focal loss
alpha float 0.1 balances focal loss
gamma float 0.25 smoothly adjusts the rate

Train Arguments

Argument DataType Default Help
output_dir str "./results" result director
save_total_limit int 10 limit of save files
save_steps int 100 saving step
num_train_epochs int 3 train epochs
learning_rate int 5e-5 learning rate
per_device_train_batch_size int 38 train batch size
per_device_eval_batch_size int 38 evaluation batch size
warmup_steps int 500 lr scheduler warm up step
weight_decay float 0.01 AdamW weight decay
logging_dir str "./logs" logging dir
logging_steps int 100 logging step
evaluation_strategy str "steps" evaluation strategy (epoch or step)
eval_steps int 100 eval steps
load_best_model_at_end bool true best checkpoint saving (loss)
Argument DataType Default Help
output_dir str "./results" result director
save_total_limit int 10 limit of save files
save_steps int 100 saving step
num_train_epochs int 10 train epochs
learning_rate int 5e-5 learning rate
per_device_train_batch_size int 31 train batch size
per_device_eval_batch_size int 31 evaluation batch size
warmup_steps int 500 lr scheduler warm up step
weight_decay float 0.01 AdamW weight decay
logging_dir str "./logs" logging dir
logging_steps int 100 logging step
evaluation_strategy str "steps" evaluation strategy (epoch or step)
eval_steps int 100 eval steps
load_best_model_at_end bool true best checkpoint saving (loss)

7. Reference

Easy Data Augmentation Paper
Korean WordNet

8. Contributor

나요한_T2073 : https://github.com/nudago
백재형_T2102 : https://github.com/BaekTree
송민재_T2116 : https://github.com/Jjackson-dev
이호영_T2177 : https://github.com/hylee-250
정찬미_T2207 : https://github.com/ChanMiJung
한진_T2237 : https://github.com/wlsl8135/
홍석진_T2243 : https://github.com/HongCu