Jiwonii97 commented 1 year ago

담당자

@gyubinc @MonteCarlolee

😀 작업 내용

입력한 텍스트에 대해 요약을 하여 프롬프트를 제공할 수 있도록 모델 개발 진행
요약데이터의 키워드만을 추출하여 프롬프트를 제공할 수 있도록 개발 진행

gyubinc commented 1 year ago

키워드 추출 중간보고

데이터셋

국립 국어원 문어 말뭉치 10,045개 중 1,000개 데이터

파일 형식 : JSON(UTF-8 인코딩)

파일 수 및 크기 : 파일 10,045개 4.24GB

내용

책, 잡지, 보고서 등 저작권 문제가 해결된 10개 장르의 저작물

전처리 방식

정규 표현식을 통한 한글, 영어 외 문자 제거
OKT 라이브러리를 통한 토큰화 + 불용어(stop word) 제거
Mecab 라이브러리를 통해 (일반명사, 고유명사, 형용사, 어근, 영어) 5가지만 남김
bi-gram, tri-gram으로 묶음
등장 횟수 2회 이하 제거, TF-IDF 기준 0.05보다 작은 단어 제거

전처리 결과 예시

토픽 모델링 결과

5개 토픽

7개 토픽

10개 토픽

결과 분석

1만개 중 1천개만 활용했기 때문에 조금 더 지엽적인 느낌이 있음, 그러나 토픽 모델링 자체가 어느정도 한정된 바운더리 내에서 조금 더 디테일하게 주제를 뽑아내는 역할이라는 특성이 있어서 넓은 범주의 글 자체를 분리하기에는 무리가 있어 보임

향후 계획

두 가지 선택지가 있음. 하나는 1만개 모두 사용해보는 것, 다른 하나는 토픽 모델링 말고 classification으로 넘어가는 방법

전자의 경우, 현재 좋은 결과가 나오지 않을 것 같으면서 colab에서는 규모 상 돌리기 어렵기 때문에 로컬로 가져오면 환경설정에 너무 많은 시간을 쓸 것 같음

따라서, 장르를 분류하는 분류 모델을 사용하는 방식을 적용해보는 것은 어떨까 싶음

아래는 그러한 분류 모델의 예시임

knlpscience commented 1 year ago

Fine-tuning ver.1

AI hub의 문학 데이터 셋을 Papago로 번역한 데이터 셋으로 google/pegasus-xsum을 간이 fine-tuning 시킨 모델knlpscience/pegasus-ft을 허깅페이스에 업로드하였습니다.

사용 방법 예시입니다.

from transformers import PegasusTokenizer, PegasusForConditionalGeneration

MODEL_NAME = "knlpscience/pegasus-ft"
tokenizer = PegasusTokenizer.from_pretrained(MODEL_NAME)
model = PegasusForConditionalGeneration.from_pretrained(MODEL_NAME)

input_ids = tokenizer.encode(passage, return_tensors='pt', add_special_tokens=True, truncation=True)
outputs = model.generate(input_ids=input_ids, num_beams=8, length_penalty=0.8, max_length=128)
decoded_summary = tokenizer.decode(outputs.squeeze(), skip_special_tokens=True)

토픽 추출과 문서 요약 서버 구축

http://118.67.133.11:30007/ 해당 URL에서 한글로 작성된 .txt 텍스트 파일을 업로드 후 캡션 생성하기 버튼을 누르면 요약, 장르, 캡션을 생성하도록 하는 streamlit 테스트 서버를 구축하였습니다. 다음 파일은 예시 한글 소설 .txt 입니다. short_novel (한).txt 해당 .txt파일을 위 주소에서 드래그하여 업로드 하시면 됩니다. 사용된 모델

번역: google trans
토픽: logistic regression
요약: knlpscience/pegasus-ft
감정: TO DO
생성: gpt-3.5-turbo

위의 예시 소설 데이터로 만든 3가지 버전 캡션입니다.

Nostalgic melody, wistful and melodic, played on piano and violin. The music gently stirs emotions of longing and reflection, like a gentle breeze drifting through memories.
Reflective melody, bittersweet and nostalgic, featuring piano and violin. The composition expresses a mix of longing and contentment, taking the listener on a journey of past memories and cherished moments.
Reflective melody, filled with nostalgia and longing, featuring a tender piano and melancholic violin. The music unfolds with a gentle pace, evoking deep emotions and capturing the essence of introspection.

만들어지는 캡션은 openai의 gpt-3.5-turbo사용으로 인해 매 시도마다 달라질 수 있으며 위의 캡션을 Musicgen-large에 넣어 생성한 예시입니다.

https://github.com/boostcampaitech5/level3_nlp_finalproject-nlp-01/assets/129038718/21a71f22-85af-4efd-a70f-f7db41dc9400

https://github.com/boostcampaitech5/level3_nlp_finalproject-nlp-01/assets/129038718/ef2ff865-4561-4f8f-8064-41fd02f27b85

https://github.com/boostcampaitech5/level3_nlp_finalproject-nlp-01/assets/129038718/a69e69bb-076e-4d7e-9598-4f109ff71fd2

boostcampaitech5 / level3_nlp_finalproject-nlp-01

요약 모델 및 키워드 추출 작업 #8

담당자