Ch.4.1.2, 4.4.3, 4.4.4 오류

coreaplate commented 1 year ago

안녕하세요,

4.1.2 pd.read_html('상장법인목록.xls'), .read_excel('상장법인목록.xlsx') 을 이행할 경우 책과 다르게 아래와 같은 비정렬된 데이터가 나옵니다.

또한 pd.read_html('url')[0] 를 할 경우 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 144: invalid start byte 해당 오류 메세지가 송출됩니다.

4.4.3 항목에서는 AttributeError: 'NoneType' object has no attribute 'a' 해당 오류 메세지가 송출됩니다.

4.4.4는 아래와 같습니다.

해결 방안이 있을까요? 감사합니다.

현재 윈도우 11, Mozilla/5.0을 사용중입니다.

Eligae commented 1 year ago

 def read_krx_code(self):
        """KRX로부터 상장기업 목록 파일을 읽어와서 데이터프레임으로 반환"""
        url = 'http://kind.krx.co.kr/corpgeneral/corpList.do?method=download&searchType=13'
        krx = pd.read_html(url, header=0, encoding='cp949')[0]
        krx = krx[['종목코드', '회사명']]
        krx = krx.rename(columns={'종목코드': 'code', '회사명': 'company'})
        krx.code = krx.code.map('{:06d}'.format)
        return krx

나중에 5장에 나오면 krx 상장기업 data 가져오는 건데, 아마 encoding이 'cp949'가 아니어서 그런것 같네요.

그리고, dataframe을 terminal에서 열면 깨져보이는 듯 할 수 있지만, 따로 df.to_csv('test.csv') 로 저장해서 확인해보세요.

Eligae commented 1 year ago

참고로, 국가 관련 데이터들을 크롤링 또는 다운받아서 사용하려할 때, utf-8이 아닌 cp949형식이 대부분이었습니다.

coreaplate commented 1 year ago

감사합니다. 덕분에 터미널 깨짐현상은 호전되었습니다.

웹사이트를 통한 웹 스크레이핑은 오류가 발생하네요... 스택 오버플로우에서 찾아보고 질문도 했지만 현재로선 답변을 찾지 못했습니다.

좋은 주말 보내시길 바랍니다.

2023년 9월 23일 (토) 오후 4:57, Riaco @.***>님이 작성:

참고로, 국가 관련 데이터들을 크롤링 또는 다운받아서 사용하려할 때, utf-8이 아닌 cp949형식이 대부분이었습니다.

— Reply to this email directly, view it on GitHub https://github.com/INVESTAR/StockAnalysisInPython/issues/177#issuecomment-1732312801, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5NURZAWDSSSD6LFQGBSX5DX33MC3ANCNFSM6AAAAAA5EEYNUA . You are receiving this because you authored the thread.Message ID: @.***>

Eligae commented 1 year ago

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen, Request
url = 'https://finance.naver.com/item/sise_day.nhn?code=000020'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
request = Request(url, headers=HEADERS)
with urlopen(request) as doc:
    html = bs(doc, 'lxml')
    pgrr = html.find('td', class_='pgRR')
    print(pgrr.a['href'])

똑같이 해보았는데, 저도 None을 return하더라구요. 코드는 맞는데, Header추가를 안하셔서 그런 듯 합니다. 그런데, 이게 좀 귀찮아지는 부분이 있어서 차라리 request만 쓰는 방법을 추천드립니다.

url = f"https://finance.naver.com/item/sise_day.nhn?code={code}&page=1"
html = BeautifulSoup(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text, "lxml")
pgrr = html.find("td", class_="pgRR")
s = str(pgrr.a["href"]).split('=')
lastpage = s[-1]

INVESTAR / StockAnalysisInPython

Ch.4.1.2, 4.4.3, 4.4.4 오류 #177