[Wiki] Feature Engineering

Background

입력 데이터의 Feature들 중 유의미한 결과를 이끌 가능성이 높은 Feature를 찾기 위해 연관 자료를 조사했습니다.

Notes

Feature Engineering

데이터에 새로운 변수를 추가해 머신러닝 모델의 성능을 극대화시키는 기술
- 데이터 변환의 단순화/가속화 및 모델의 성능 향상 목표
데이터의 오류를 찾는 방법
- 도메인 지식
- 시각화
- 통계분석
구성 요소
- Feature Creation: 모델에 도움이 될 새로운 변수를 추가하는 것. 추가하거나 삭제하는 것 포함.
- Transformation: feature 표현을 다른 형태로 바꾸는 것.
- Feature Extraction: 원본 데이터의 상관관계나 중요한 정보를 왜곡하지 않고, 알고리즘이 처리할 수 있는 양의 데이터로 압축시키는 것.
- EDA: 데이터의 패턴을 분석하는 것.
- Benchmark: 비교 대상이 되는 모델. 새로운 모델의 성능을 상대적으로 판단해볼 수 있는 모델.
인공적인 특징을 알고리즘이 사용할 수 있도록 설계하는 작업
- 최종 목표는 결과 데이터셋의 최적화를 통해 비즈니스 문제에 영향을 미치는 모든 중요 요소를 반영하는 것
  Feature Engineering Techniques for ML
  
  Missing Value

Imputation

Removal: 결측치가 있는 entry를 삭제함.
데이터 셋이 적은 경우 전체 학습가능 데이터 셋이 적어지는 문제

임계치를 정해서 넘어갈 경우 제거

threshold = 0.7
#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold]

Numerical Imputation: 특정 값으로 결측치를 채우는 방식.

default value를 정하는 문제. 못 정할경우 median 추천함.

#Filling all missing values with 0
data = data.fillna(0)
#Filling missing values with medians of the columns
data = data.fillna(data.median())

Categorical Imputation: 카테고리형 feature일때, 최빈값으로 하는게 좋음. 하지만 균등 분포인경우 other 등으로 별도로 빼는 것도 괜찮음.
```
#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)
```

Handling Outliers: 모델에 따라 영향이 클수도/작을수도 있음(예를 들어, 선형 회귀인 경우, 굉장히 민감함)

Detection with Standard Deviation

x * standard deviation 보다 높은 평균까지의 거리인 경우, Outlier 로 치부할 수 있음

x 는 2 ~ 4 사이 정도


#Dropping the outlier rows with standard deviation
factor = 3
upper_lim = data['column'].mean () + data['column'].std () * factor
lower_lim = data['column'].mean () - data['column'].std () * factor

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

- Detection with Percentiles
- 상 하단 경계에서 특정 비율 내의 대상들을 이상치로 치부
``` python
#Dropping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

Outlier Dilemma: Drop vs Cap
drop을 하면 훈련 데이터의 사이즈가 줄어듦

cap을 하면 훈련 데이터의 분포가 달라짐

#Capping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data.loc[(df[column] > upper_lim),column] = upper_lim
data.loc[(df[column] < lower_lim),column] = lower_lim

Replacing values: 결측치로 치부하고 처리함
Binning: 카테고리형/연속형 변수 모두에 사용 가능함
모델의 성능을 높이고, 오버피팅을 방지할 수 있지만 속도가 느려짐

상세정보를 희석시켜 regularize 하는 효과가 있다.


#Numerical Binning Example
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"])
value   bin
0      2   Low
1     45   Mid
2      7   Low
3     85  High
4     28   Low
#Categorical Binning Example
 Country
0      Spain
1      Chile
2  Australia
3      Italy
4     Brazil
conditions = [
data['Country'].str.contains('Spain'),
data['Country'].str.contains('Italy'),
data['Country'].str.contains('Chile'),
data['Country'].str.contains('Brazil')]

choices = ['Europe', 'Europe', 'South America', 'South America']

data['Continent'] = np.select(conditions, choices, default='Other') Country Continent 0 Spain Europe 1 Chile South America 2 Australia Other 3 Italy Europe 4 Brazil South America

Log Transform: skewed distribution을 normal혹은 less-skewed distribution으로 변경하는 방법

크기를 정규화해 이상치의 효과를 줄여줌

양수 데이터에만 사용해야 함

#Log Transform Example
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['log+1'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log)
value  log(x+1)  log(x-min(x)+1)
0      2   1.09861          3.25810
1     45   3.82864          4.23411
2    -23       nan          0.00000
3     85   4.45435          4.69135
4     28   3.36730          3.95124
5      2   1.09861          3.25810
6     35   3.58352          4.07754
7    -12       nan          2.48491

One-hot Encoding: 유한한 set의 요소를 각각의 인덱스로 표현하는 방식. 해당하는 인덱스 값만 1로 나타냄.
Grouping Operations
- Categorical Column Grouping
- 최빈값 사용
```
data.groupby('id').agg(lambda x: x.value_counts().index[0])
```
- pivot table
- one-hot encoding 뒤에 group by 적용
- Numerical Column Grouping
- 합이나 평균을 사용해 그루핑한다
Feature Split
- 문자형이나 Tidy 하지 않은 데이터를 분리함
- Binning 이나 grouping이 가능해짐
- 잠재적인 정보를 추출할 수 있어 모델의 성능향상을 기대할 수 있다.
Scaling: 크기 조절은 ML에서 광범위하고 어려운 문제임. 수행하고나면, 연속형 변수들의 범위가 비슷해짐.
- Normalization: [0,1] 로 스케일링. feature의 분포에는 영향이 없지만, 낮은 표준편차로 이상치의 영향이 적어짐. 정규화전에 이상치를 처리하는 게 추천된다.
- Standardization(z-score normalization): 표준편차를 고려한 스케일링 방법. feature들의 표준편차가 다르면, 범위도 다를것. 따라서 이상치의 영향이 줄어듦. 0 mean 1 variance의 분포로 만듦. 평균값을 뺀값을 표준편차로 나눠서 계산함.
  Tools
FeatureTools: 시간형, 관계형 데이터를 Feature 행렬로 변환하는데 탁월.
- 쉽게 사용 가능
- ML이나 예측 모델에서 유용한 feature들을 구조화해줌.
- 관계형 데이터 베이스와 연계해 사용하기 용이함.
AutoFeat: 선형적인 예측 모델에 대해 자동화된 feature 선택을 도와줌
- 카테고리형 데이터는 one-hot encoding으로 처리함
- scikit-learn과 유사한 형태 인터페이스 모델 제공
- 관계형 데이터엔 좋지 않음.
- logistical data 에 좋음
TsFresh: python 패키지.
- 시계열 분류/회귀 문제에 탁월한 오픈소스
- peak의 횟수, 평균값, 최대값, 시간 반전 대칭 통계 등을 추출할 수 있음
- FeatureTools와 통합가능함
OneBM
- 관계형/비관계형 데이터 모두 지원함
- FeatureTools와 비교해, 단순하거나 복잡한 feature를 모두 생성함
- Kaggle 대회 테스트에 적용시, SOTA 모델보다 성능이 뛰어났음
ExploreKit
- data 의 메타정보를 학습해, feature의 Rank 를 메김

Ideas

결측치를 다른 Feature들의 조합으로 학습시킨 모델의 결과로 채우면 어떨까?
- 결측치 예측에 사용된 feature들에 오버피팅 되려나?
activation function이 각자 Feature의 값 분포에 따라 맞는게 있지 않을까?
- 각 feature set을 특성에 맞는 subset으로 나눠 subset별로 활성함수가 다른 모델을 만드는건 어떨까
Tidy Data에 대해 알아보자: https://r2bit.com/book_viz/tidy-data.html

boostcampaitech6 / level1-bookratingprediction-recsys-02