네이버 금융에서 셀트리온 스크래핑 시 AttributeError

INVESTAR / StockAnalysisInPython

458 stars 412 forks source link

네이버 금융에서 셀트리온 스크래핑 시 AttributeError #24

Open dlrbcnvk opened 3 years ago

dlrbcnvk commented 3 years ago

finance.naver.com 의 셀트리온 url에서 스크래핑하는 과정에서 책에 나온 코드대로 작성했더니 AttributeError: 'NoneType' object has no attribute 'a' 에러 발생합니다. 어떻게 해결할 수 있을까요??

INVESTAR commented 3 years ago

AttributeError: 'NoneType' object has no attribute 'a'

상기 오류는 네이버에서 웹 스크레이핑을 차단했기 때문에 발생하는 것으로 읽어온 웹 페이지가 없기 때문에 a 태그를 찾지 못해서 발생하는 오류입니다.

네이버 금융에서 http 패킷 헤더에 브라우저 정보(User-Agent)가 존재하는지 체크하기 때문에 requests 라이브러리를 사용하여 웹 브라우저 정보를 보내도록 수정해야 합니다. 즉, 아래처럼 BeautifulSoup() 생성자에 넘겨주는 HTML 정보를 urlopen()으로 읽은 정보가 아닌 requests.get()으로 읽은 정보로 변경해 주시기 바랍니다.

import requests url = "http://finance.naver.com/item/sise_day.nhn?code=005930" html = BeautifulSoup(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text, "lxml")

5장에 나오는 DBUpdater.py도 코드 수정이 필요하며 변경된 코드는 깃헙에 올려둔 DBUpdaterEx.py를 참고하시기 바랍니다.

ghmoon90 commented 3 years ago

I had a same problem and changed the code like this.

hdr = {'Host': 'finance.naver.com', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8', 'Accept-Language': 'ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'Cache-Control': 'max-age=0'}

req = urllib.request.Request(URL,headers = hdr)

with urlopen(req) as doc:

blah blah ~

Maybe it can be updated on the later release of book

Is crawling the Naver Finance in law? ? Why would they updated network request policy like this ?

when they forbid the crawling, what do you suggest for updating stock price DB ? Any free stock market opensource ??

Could we get the market data from "http://www.krx.co.kr/" ?

INVESTAR commented 3 years ago

Thanks for notifying me. I checked that below code also works fine.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "http://finance.naver.com/item/sise_day.nhn?code=005930"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
with urlopen(req) as doc:
    html = BeautifulSoup(doc, 'lxml')
    print(html)

I don't know why Naver Finance forbid the crawling without User-Agent information, but I think that they need to reduce overhead of web servers.

Even if Naver Finance blocks any kind of the crawling, you can scrape from the other web sites such as Yahoo finance or Investing.com, and it seems that finance-datareader is can be a good alternative.

https://github.com/FinanceData/FinanceDataReader/wiki

ghmoon90 commented 3 years ago

아 한글로 회신 주셔도 됩니다 ㅎ. 작년에 책사고 묵혀두다가 최근에 centos에서 공부중인데 xrdp에서 한글 104키보드가 동작을 안해서 영어로 썻어요. FinanceData 한번 시도해 봐야겠네요. 감사합니다.