Sejong-Kaggle-Challengers / jeongmin

๐Ÿ“Œ Sejong Kaggle Challenger ์ด์ •๋ฏผ ๋ ˆํผ์ง€ํ† ๋ฆฌ
0 stars 0 forks source link

[14์ฃผ์ฐจ] Sejong AI Challenge ๋ฌธ์ œ2 #13

Open mingxoxo opened 3 years ago

mingxoxo commented 3 years ago

https://www.kaggle.com/c/sejong-ai-challenge-p2/overview

mingxoxo commented 3 years ago

์ถœ์ฒ˜ : https://bkshin.tistory.com/129

CountVectorizer

๋‹จ์–ด feature์— ๊ฐ’์„ ๋ถ€์—ฌํ•  ๋•Œ, ๊ฐ ๋ฌธ์„œ์—์„œ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ํšŸ์ˆ˜, ์ฆ‰ Count๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒฝ์šฐ ์นด์šดํŠธ ๋ฒกํ„ฐํ™”๋ผ๊ณ  ํ•œ๋‹ค. ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹ํ•œ๋‹ค.

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray()) # ์ฝ”ํผ์Šค๋กœ๋ถ€ํ„ฐ ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•œ๋‹ค.
print(vector.vocabulary_) # ๊ฐ ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ถ€์—ฌ๋˜์—ˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

์ถœ๋ ฅ ๊ฒฐ๊ณผ

[[1 1 2 1 2 1]]
{'you': 4, 'know': 1, 'want': 3, 'your': 5, 'love': 2, 'because': 0}

you์™€ want๊ฐ€ 2๋ฒˆ์”ฉ ์žˆ์œผ๋ฏ€๋กœ ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹ํ•œ๋‹ค. CountVectorizer๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 2์ž๋ฆฌ ์ด์ƒ์˜ ๋ฌธ์ž์— ๋Œ€ํ•ด์„œ๋งŒ ํ† ํฐ์œผ๋กœ ์ธ์‹ํ•˜๊ธฐ ๋•Œ๋ฌธ์— I๋Š” ์—†์–ด์กŒ๋‹ค.

CountVectorizer์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ถˆ์šฉ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋‹ค.

vect = CountVectorizer(stop_words="english")

TF-IDF

์นด์šดํŠธ ๊ธฐ๋ฐ˜ ๋ฒกํ„ฐํ™”๋Š” ์นด์šดํŠธ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆ์šฉ์–ด์™€ ๊ฐ™์€ ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ์ž์ฃผ ์“ฐ์ผ ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๋‹จ์–ด๋“ค์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ธ์‹๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด TF-IDF(Term Frequency - Inverse Document Frequency) ๋ฒกํ„ฐํ™”๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

TF-IDF๋Š” ๊ฐœ๋ณ„ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋˜, ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์—๋Š” ํŽ˜๋„ํ‹ฐ๋ฅผ ์ฃผ๊ณ , ํ•ด๋‹น ๋ฌธ์„œ์—์„œ๋งŒ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ์‹ค์งˆ์ ์œผ๋กœ ์ค‘์š”ํ•œ ๋‹จ์–ด์ธ์ง€ ๊ฒ€์‚ฌํ•œ๋‹ค. ๋ฌธ์„œ์˜ ์–‘์ด ๋งŽ์„ ๊ฒฝ์šฐ์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™”๋ณด๋‹ค TF-IDF ๋ฐฉ์‹์˜ ๋ฒกํ„ฐํ™”๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

image

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
tfidfv = TfidfVectorizer().fit(corpus)
print(tfidfv.transform(corpus).toarray())
print(tfidfv.vocabulary_)

์ถœ๋ ฅ ๊ฒฐ๊ณผ

[[0.         0.46735098 0.         0.46735098 0.         0.46735098 0.         0.35543247 0.46735098]
 [0.         0.         0.79596054 0.         0.         0.         0.         0.60534851 0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.         0.57735027 0.         0.        ]]
{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}

์นด์šดํŠธ ๊ธฐ๋ฐ˜ ๋ฒกํ„ฐํ™”์ธ CountVectorizer๋Š” ๋‹จ์ˆœํžˆ ๋นˆ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ‘œํ˜„์„ ํ•ด์ฃผ์ง€๋งŒ TF-IDF ๋ฒกํ„ฐํ™”๋Š” ๋‹ค๋ฅธ ๋ฌธ์žฅ์—์„œ์˜ ๋‹จ์–ด ๋นˆ๋„๋„ ๊ณ ๋ คํ•˜์—ฌ ํ•ด๋‹น ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ CountVectorizer๋ณด๋‹ค๋Š” TF-IDF๋ฅผ ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋งŽ์ด ์“ด๋‹ค.