BoostCamp AI Tech - [NLP] 문장 내 개체간 관계 추출

Relation Extraction For Korean

RE is a task to identify semantic relations between entity pairs in a text. The relation is defined between an entity pair consisting of subject entity and object entity. The goal is then to pick an appropriate relationship between these two entities in Korean sentence.

Project Overview

프로젝트 목표

주어진 문장에서 Subject Entity와 Object Entity 사이의 관계를 예측하는 프로젝트

데이터셋

KLUE Dataset의 RE Task데이터로 30개의 관계가 존재하고, 약 32000개의 문장을 학습데이터로 학습한다.
관계(label)의 분포는 굉장히 불균형했는데 No relation(관계없음)의 분포가 9534개로 가장 많고 per:place_of_death(사망한 위치)의 분포가 40개로 가장 적었다.

데이터 전처리

동일한 문장의 데이터가 42개, 동일하면서도 라벨이 달랐던 데이터가 5개 존재했고 둘 중 올바른 라벨로 수정하였다.

평가지표

'No relation' 라벨을 제외한 Micro F1-score로 평가하였다.

Prerequisites Installatioin
Quick Start
Best Score Model
Model Architecture
Usage
Augmenters
Contributor

1. Prerequisites Installatioin

requirements.txt can be installed using pip as follows:

$ pip install -r requirements.txt

2. Quick Start

Train
```
python train.py
```
inference
```
python inference.py
```

3. Best Score Model

Private Score : 72.681
Base Model : RoBERTa-large
Hyper parameter are same as for RoBERTa-large
Using Tokenize like BERT

4. Model Architecture

KLUE/RoBERTa-large

5. Usage

Using Focal loss

    "focal_loss":{
        "true" : True,
        "alpha" : 0.1,
        "gamma" : 0.25
      },

Using Imbalanced Sampler

"Trainer" : {
      "use_imbalanced_sampler" : true 
    },

Using Tokenize like BERT

BERT result [CLS] the man went to [MASK] store [SEP]he bought a gallon [MASK] milk [SEP] LABEL = IsNext like BERT result [CLS][obj] 변정수[/obj] 씨는 1994년 21살의 나이에 7살 연상 남편과 결혼해 슬하에 두 딸 [subj]유채원[/subj], 유정원 씨를 두고 있다. [SEP][obj][PER]변정수[/obj][subj][PER]유채원[/subj] [SEP]

"dataPP" :{ 
    "active" : true,
    "entityInfo" : "entity&token",
    "sentence" : "entity"
},

AEDA

"aeda" : "None"

Default 하위 15개 label에 대해 AEDA 적용 (Mecab 설치필요)

"aeda" : "default"

Mecab 설치방법

sudo apt install g++
sudo apt update
sudo apt install default-jre
sudo apt install default-jdk
pip install konlpy

# install khaiii
cd ~
git clone https://github.com/kakao/khaiii.git
cd khaiii
mkdir build
cd build
pip install cmake
sudo apt-get install cmake
cmake ..
make resource
sudo make install
make package_python
cd package_python
pip install .
cd ~
apt-get install locales
locale-gen en_US.UTF-8
pip install tweepy==3.7.0
# install mecab
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
tar xvfz mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2
./configure
make
make check
sudo make install
sudo ldconfig
cd ~
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
tar xvfz mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720
./configure
make
sudo make install
cd ~
mecab -d /usr/local/lib/mecab/dic/mecab-ko-dic
apt install curl
apt install git
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
pip install mecab-python

Custom Mecab 설치 불필요

# of no_relation * 0.4 보다 적은 데이터를 가지는 label에 대해서 augmantation 실행 sentence를 space(' ')기준으로 나눈 후 entity에 해당하는 데이터를 합친 후 aeda 적용

"aeda" : "custom"

aug family

모델이 헷갈려 하는 가족 관계 레이블에 대한 augmentation

"aug_family" = true

typed entity

An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen

"type_ent_marker" = true

typed punct