bohyuncho commented 3 years ago

시각화에 이어 딥러닝을 배웠으니 이를 추가해서 게시글을 써보자! 라는 포부는 당당했고! 주말을 저당잡혔다.

jhk0530 commented 3 years ago

bohyuncho commented 3 years ago

지금올려요 ㅠㅠㅠㅠ

bohyuncho commented 3 years ago

전에 무식하면 용감하다고, 3일 만에 코로나 시각화를 한 후 응급실에 실려간 필자를 기억할 것이다. (모른다면 왜 갈만했는지 보고 와줘요😆)이전 글 참고

그래, 한 달 만에 시각화까지 하고 대단한데?! 싶었던 자기만족과는 다르게 boot camp 일정은 호락호락하지 않았고, 머신러닝을 배우게 되는데?! 코드스테이츠 커리큘럼 참조

다시 한번 저세상(?) 갈 뻔한 '코로나 + 머신러닝 튜토리얼'을 진행해보자.

colab에서 진행한다는 전제하에 코딩이 진행됩니다. 궁금한 점, 틀린 점, 혹은 다른 꿀팁 있으면 메일 주세요 😘

데이터를 불러오기 전에! 준비해야 할 환경

데이터 출처 :

https://github.com/laxmimerit/Covid-19-Preprocessed-Dataset/

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks

⚠️⚠️주의! ⚠️⚠️ colab은 주피터 노트북과 달리 매번 새로운 버전의 프로그램을 설치해 주어야 한다. 그래서 필요한 파일을 미리!pip로 다운로드,import로 설치해두도록 하자.
해당 자료에서는 이렇게 진행하였다.
# 필요한 패키지 및 라이브러리 설치
# best_score_ 사용을 위한 install
!pip install -U scikit-learn

use category_encoders 사용을 위한 install

!pip install -q category_encoders

pandas_profiling 사용을 위한 install

!pip install -q pandas-profiling==2.7.1

지도 데이터 시각화를 위해 folium 설치

!pip install folium

다양한 시각화를 위해 potly 설치

!pip install plotly

colab에 있는 potly/folium 은 낮은 버전이기 때문에 추후 treemap에서 오류 발생함

!pip install --upgrade plotly !pip install --upgrade folium

chart_studio 설치

!pip install chart_studio

cufflinks 설치하면 pandas 에서 바로 데이터 plotly로 넘겨서 그래프 제작함

!pip install cufflinks


```python
# 필요 자료들 불러오기

#분석에 필요한 라이브러리 임포트
# 수치 계산에 사용하는 라이브러리
import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
import math
import random
from datetime import timedelta

#그래프를 그리기 위한 라이브러리
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import chart_studio
import folium

%matplotlib inline

#선형모델을 추정하는 라이브러리
import statsmodels.formula.api as smf
import statsmodels.api as sm
#표시 자릿수 지정
%precision 3
#그래프 선명도 및 한글 깨짐 방지
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

또한 애니메이션 시각화를 진행하면서 생긴 꿀팁을 더하자면, Google colab에서 out 창으로 직접 display를 쉽게 할 수 있는 방법
```
# fig.write_html("file.html") / html파일을 Colab이 지원을 안함.
# 그래서 file.html 로 저장해서 저장된파일로 확인 가능
```

from plotly.offline import iplot, init_notebook_mode

def configure_plotly_browser_state(): import IPython display(IPython.core.display.HTML('''

    <script>
      requirejs.config({
        paths: {
          base: '/static/base',
          plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
        },
      });
    </script>
    '''))


```python
# 함수 적용
configure_plotly_browser_state()
init_notebook_mode(connected=False)

DS의 숙명! 데이터 정리! [기본 EDA 진행]

# 매번 업로드되는 데이터를 불러오는 방법

import plotly as py
py.offline.init_notebook_mode(connected = True)

import os

try:
  os.system("rm -rf Covid-19-Preprocessed-Dataset")
except:
  print('File does not exist')

# 데이터 입력

!git clone https://github.com/laxmimerit/Covid-19-Preprocessed-Dataset.git

# load the dataset
df = pd.read_csv('Covid-19-Preprocessed-Dataset/preprocessed/covid_19_data_cleaned.csv', parse_dates=['Date'])
country_daywise = pd.read_csv('Covid-19-Preprocessed-Dataset/preprocessed/country_daywise.csv', parse_dates=['Date'])
countywise = pd.read_csv('Covid-19-Preprocessed-Dataset/preprocessed/countrywise.csv')
daywise = pd.read_csv('Covid-19-Preprocessed-Dataset/preprocessed/daywise.csv', parse_dates=['Date'])

#확인
print("> covid_19_data_cleaned.csv : \n ", df.head(2),'\n')
print("> country_daywise.csv : \n ", country_daywise.head(2),'\n')
print("> countrywise.csv : \n ", countywise.head(2),'\n')
print("> daywise.csv : \n ", daywise.head(2),'\n')

# fill NA
df['Province/State'] = df['Province/State'].fillna("")

# null 값있는지 확인 확인
# print( 'A') = A를 출력하라
print("> covid_19_data_cleaned.csv = df \n ", df.isnull().sum(),'\n')
print("> country_daywise.csv = country_daywise \n ", country_daywise.isnull().sum(),'\n')
print("> countrywise.csv = countywise \n ", countywise.isnull().sum(),'\n')
print("> daywise.csv = daywise \n ", daywise.isnull().sum(),'\n')
# 전체 null 값 없음 확인
# 만약 된다면 해당 데이터를 가져온 사이트에서 설명서를 확인하는게 좋음

# 정보확인
print("> covid_19_data_cleaned.csv = df \n ", df.info(),'\n')
print("> country_daywise.csv = country_daywise \n ", country_daywise.info(),'\n')
print("> countrywise.csv = countywise \n ", countywise.info(),'\n')
print("> daywise.csv = daywise \n ", daywise.info(),'\n')

# grouping by date
confirmed = df.groupby('Date').sum()['Confirmed'].reset_index()
recovered = df.groupby('Date').sum()['Recovered'].reset_index()
deaths = df.groupby('Date').sum()['Deaths'].reset_index()

#확인
print('head','\n ')
print("> confirmed : \n ", confirmed.head(2),'\n')
print("> recovered : \n ", recovered.head(2),'\n')
print("> deaths : \n ", deaths.head(2),'\n')
print('\n ','tail')
print("> confirmed : \n ", confirmed.tail(2),'\n')
print("> recovered : \n ", recovered.tail(2),'\n')
print("> deaths : \n ", deaths.tail(2),'\n')

# 만약 특정 국가만 필터를 원한다면?
k_df = df.query('Country == "Korea, South"')
k_df

다른 답안들은 직접 한번 확인해 봐용🥰

#날짜 데이터 사용을 위한 데이터 변경
df['Date'] = df['Date'].astype(str)
df['Date'] = pd.to_datetime(df['Date'])
df.info()

최신의 자료 필터링


temp = df.groupby('Date')['Confirmed', 'Deaths', 'Recovered','Active'].sum().reset_index()
print("> drop = False /tail \n", temp.tail(1),'\n')
temp = temp[temp['Date'] == max(temp['Date'])].reset_index(drop = True)
print("> drop = True / head \n", temp.head(1),'\n')

tm = temp.melt(id_vars = 'Date', value_vars=['Active', 'Deaths','Recovered']) print("> tm.head \n",tm.head(1),'\n') print("> tm.dtypes \n",tm.dtypes)


- 현재까지 진행 한 우리의 데이터셋 확인
```python
#데이터셋 확인
print("> df.head : " , df.head(3))
print("> df.info : ", df.info)
print("> country_daywise.head : ", country_daywise.head(2))
print("> country_daywise.info : ", country_daywise.info)
print("> daywise.head : ", daywise.head(2))
print("> daywise.info : ", daywise.info)

help!!!!!!!!!

갑자기 시간이 촉박해지니깐 머리가 새하애졌어요ㅠㅠㅠ

브금

저 데이터 셋들 covid로 합치고 딥러닝 돌려야하는데 헷갈려요 ㅠㅠㅠ

무튼 합쳐다고 치고 진행해보자....

pandas profiling 사용하기

# 위에 저 긴 과정이 한 번에 끝내는 EDA 기적을 볼 수 있어요
from pandas_profiling import ProfileReport

profile = ProfileReport(covid, title="Check basic information")
profile.to_widgets()
profile.to_file(output_file="covid_profiling.html") # html

위에서 정리한 훈련 데이터를 훈련/검증/테스트 세트로 나누어 볼까요?

# OneHotEncoding
covid_enco = pd.get_dummies(covid, prefix=['Country', 'Confirmed', 'Recovered', 'Deaths','Active], drop_first=True)
covid_enco

from sklearn.model_selection import train_test_split

# Step1) train / test 으로 나누기 (80:20)
train, test = train_test_split(covid_enco, test_size=0.2, random_state=3)
# 확인
print("> train : ", len(train),"개","> test : ", len(test),"개")
print()
# Step2) train을 다시 train/validation 으로 나누기 (3:1)
train, val = train_test_split(train, test_size = 0.25, random_state=2)
# 확인
print("> train : ", len(train),"개", "> val : ", len(val),"개")
print("> train.shape : ", train.shape, "> test.shape : ", test.shape,"> val.shape : ", val.shape)
print()

# feature/taraget 설정
# 현재 모델의 target은 사망 여부(Deaths)이며, 나머지 column들은 모두 feature가 된다.
feature = list(df_enco.columns)
feature.remove("Deaths")
target = "Deaths"

# 확인
print("> feature: ", len(feature),"개")
print(feature)
print()
print("> target: ")
print(target)

제대로 나눴는지 확인해보자

# Target / Featrue 구분
print("> 원본 : ", df_enco.shape)
# train 데이터셋
X_train = train[feature]
y_train = train[target]
print("> train : ", X_train.shape, y_train.shape)
# test 데이터셋
X_test = test[feature]
y_test = test[target]
print("test : ", X_test.shape, y_test.shape)
# validation 데이터셋
X_val = val[feature]
y_val = val[target]
print("> validation : ", X_val.shape, y_val.shape)

분류모델과 비교하기 위한 기준모델 설정

# train 데이터셋의 최빈클래스 확인
y_train.value_counts()

# 예측값을 모두 최빈클래스로 설정
Base_value = y_train.mode()
y_train_pred = [Base_value] * len(y_train)

# 정확도
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_train_pred)

사이킷런의 LogisticRegression을 사용해 모델을 만든다면?

from sklearn.linear_model import LogisticRegression

# LogisticRegression instance 생성
logistic = LogisticRegression(random_state=0, max_iter=10000)
# LogisticRegression instance 학습
logistic.fit(X_train, y_train)
#확인
print("> 정확도 : ", logistic.score(X_val, y_val))

feature scaling를 해본다면?

여러가지 종류의 Scalar를 어떤 상황에 적용해야 하는지 확인 작업

from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler

#StandardScaler instacne 생성
sclaer = StandardScaler()
# mean = 0, var = 1이 되도록 표준화
X_train_scaled = sclaer.fit_transform(X_train)
X_val_scaled = sclaer.transform(X_val)

검증세트에서 정확도를 확인

# scaling한 결과로 logistic regression 모델에 학습
logistic.fit(X_train_scaled, y_train)

# 확인
y_val_pred = logistic.predict(X_val_scaled)

print('> 정확도 : ', accuracy_score(y_val, y_val_pred))

저도 결과 해석을 해보고 싶은데 말이죠... 누구 도와주시겠습니까....?

bohyuncho commented 3 years ago

친구가... it 일은 시간에 따라서 머리없는 자식을 만들고, 다리없는 자식 만들고.... 온전한 자식을 못만들어서 유지보수하는 일이라고 햇는데...이 블로그 글이 그렇구나...

codestates / ds-blog

[조보현][beginner to beginner] Covid-19 Data Analysis + 머신러닝 튜토리얼 #170

다시 한번 저세상(?) 갈 뻔한 '코로나 + 머신러닝 튜토리얼'을 진행해보자.

데이터를 불러오기 전에! 준비해야 할 환경

use category_encoders 사용을 위한 install

pandas_profiling 사용을 위한 install

지도 데이터 시각화를 위해 folium 설치

다양한 시각화를 위해 potly 설치

colab에 있는 potly/folium 은 낮은 버전이기 때문에 추후 treemap에서 오류 발생함

chart_studio 설치

cufflinks 설치하면 pandas 에서 바로 데이터 plotly로 넘겨서 그래프 제작함

DS의 숙명! 데이터 정리! [기본 EDA 진행]

help!!!!!!!!!

갑자기 시간이 촉박해지니깐 머리가 새하애졌어요ㅠㅠㅠ

무튼 합쳐다고 치고 진행해보자....

pandas profiling 사용하기

위에서 정리한 훈련 데이터를 훈련/검증/테스트 세트로 나누어 볼까요?

분류모델과 비교하기 위한 기준모델 설정

사이킷런의 LogisticRegression을 사용해 모델을 만든다면?

feature scaling를 해본다면?

검증세트에서 정확도를 확인