feat : Argument로 entity tagging, 전처리 및 code merging

presto105 commented 3 years ago

Roberta tokenizer에서 UNK 문자들 삭제 및 치환으로 UNK 개수를 줄였습니다.
위 내용을 적용하여 돌아갈 수 있도록 load_data.py , train.py , inference.py 수정해두었습니다. inference시에도 train과 똑같이 진행해야합니다.
UNK token 찾는 ipynb 파일(UNK_token_text_search.ipynb) 에러 및 코드를 수정하였습니다

Preprocessing/preprocessor.py 위 파일에 기존 기성님의 entity tag와 전처리 코드 추가해두었습니다.

0/1 => False/True 로 mode를 바꿀 수 있습니다. entity_flag를 키고 싶을때

python train.py --PLM klue/roberta-small --entity_flag 1 python inference.py --PLM klue/roberta-small --entity_flag 1

preprocessing_flag 를 키고 싶을때

python train.py --PLM klue/roberta-small --preprocessing_flag 1 python inference.py --PLM klue/roberta-small --entity_flag 1

둘다

python train.py --PLM klue/roberta-small --entity_flag 1 --preprocessing_flag 1 python inference.py --PLM klue/roberta-small --entity_flag 1

위와 같이 사용하시면 됩니다!

전처리 내용

sentence = re.sub(r'[À-ÿ]+','', sentence) # 독일어
sentence = re.sub(r'[\u0600-\u06FF]+','', sentence)  # 사우디어
sentence = re.sub(r'[ß↔Ⓐب€☎☏±]+','', sentence) # 특수문자
sentence = re.sub('–','─', sentence)
sentence = re.sub('⟪','《', sentence)
sentence = re.sub('⟫','》', sentence)
sentence = re.sub('･','・', sentence)

날짜 format에 맞춰subword UNK(– or -) 를 ~로 변환

ex) 1223년 – => 1223년 ~ ⇒ subword UNK[–-] 를 ~로 변환

이번 버전의 전처리 이후의 UNK

Senentence에서 나타나는 UNK
Entity에서 나타나는 UNK

Yebin46 commented 3 years ago

flag 있는 거 너무 좋네요! 고생 많으셨습니다ㅜㅜ 감사합니다

j961224 commented 3 years ago

정리 및 전처리 방법 감사합니다!

merge되면 이거에 맞춰 k-fold 만들겠습니다~!

boostcampaitech2 / klue-level2-nlp-02

feat : Argument로 entity tagging, 전처리 및 code merging #16

이번 버전의 전처리 이후의 UNK