抓 omdb API 的電影資料庫

Shih-Po commented 8 years ago

關注點

把抓到的 Unicode String 轉成一個完整的 json 檔存下來
考慮網路問題，request 可能中斷
勿被認為攻擊，for 迴圈要加 sleep、模擬人的操作

max80713 commented 8 years ago

抓取含有Å字母的字串並且輸出：

import json
res = requests.get("http://www.omdbapi.com/?i=tt0107349&plot=full&r=json")
with open('tmp.json','w') as f:
    f.write(res.text)

超過ascii編碼範圍(0~127)的字符會出現錯誤訊息：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 124: ordinal not in range(128)

先查詢python2預設輸出編碼：

import sys    
sys.getdefaultencoding()

得知python2預設輸出編碼為 'ascii'

解決方法：

reload(sys)    # to enable `setdefaultencoding` again
sys.setdefaultencoding('utf8')

Shih-Po commented 8 years ago

解決方法: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

根據此篇文章，我們首先瞭解 unicode 和 utf-8 是不一樣的：

unicode 指的是萬國碼，是一種"字碼表"

而 utf-8 是這種字碼表儲存的編碼方法

unicode 不一定要由 utf-8 這種方式編成 bytecode 儲存

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

python中字串型態分成2種，一種是byte string，一種是unicode string，一般來說設定好 #coding=utf-8 後

所有帶中文的參數宣告都會被宣告成utf-8編碼的 byte string

在函式中中所產生的字串則是 unicode string

混用 (串接) 時 python 會嘗試把 byte string 轉為 unicode string

python 2 預設的解碼器為 ascii ，就會造成這個錯誤

解決方法

字串全都轉 byte string
字串全都轉 unicode string
更改設定預設解碼器為 utf-8

以下程式碼照第一種方式解決，實測 ok

for i in range(len(omdb_full_lines)):
    imdb_id = 'tt' + omdb_full_lines[i].split(',')[1]
    res = requests.get("http://www.omdbapi.com/?i=" + imdb_id + "&plot=short&r=json")

    # res.text 是 res 的 str 版
    # 這裡將 res.text 抓到的 unicode string 轉為 utf-8
    movie = res.text.encode('utf-8') + ', '

    # Add to file in this for loop
    with open ('omdb.json', 'a') as t:
        t.write(movie)

Shih-Po / slidebrothers

抓 omdb API 的電影資料庫 #5

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

解決方法