datasci-info / AG101

0 stars 0 forks source link

Renew youtube_crawler.py, close #7 #8

Closed anniehuang921 closed 9 years ago

anniehuang921 commented 9 years ago

finished task, please review!

@c3h3 @adrianliaw @ChihChengLiang

c3h3 commented 9 years ago

未修改 README.md ... XD

anniehuang921 commented 9 years ago

README.md 保持原樣XD finished task, please review!

@c3h3 @adrianliaw @ChihChengLiang

ChihChengLiang commented 9 years ago

我執行的結果,好像有一些問題(我原來資料庫有一些東西)

我是看README執行YT_CHANNEL_ID="TWuseRGroup" MONGO_URI="mongodb://localhost:27017/agilearning" python youtube_crawler.py

/home/chihchengliang/.pyenv/versions/2.7.8/lib/python2.7/site-packages/setuptools-5.6-py2.7.egg/pkg_resources.py:1049: UserWarning: /home/chihchengliang/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
next_page_list =  ['https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads/']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=26&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=51&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=76&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=101&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=126&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=151&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=176&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=201&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=226&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=251&max-results=25']
next_page_list =  [u'https://gdata.youtube.com/feeds/users/TWuseRGroup/uploads?alt=json&start-index=276&max-results=25']
next_page_list =  []
Traceback (most recent call last):
  File "youtube_crawler.py", line 29, in <module>
    if ddt !=[]:learning_resources_collection.insert(ddt)
  File "build/bdist.linux-x86_64/egg/pymongo/collection.py", line 410, in insert
  File "build/bdist.linux-x86_64/egg/pymongo/helpers.py", line 202, in _check_write_command_response
pymongo.errors.DuplicateKeyError: insertDocument :: caused by :: 11000 E11000 duplicate key error index: agilearning.learningResources.$_id_  dup key: { : "YTV_NqXyh1rOy-s" }
anniehuang921 commented 9 years ago

@ChihChengLiang 我試著把資料庫裝滿再移除一些資料,沒有遇到相同狀況。。。

youtube_crawler.ipynb 可以做測試用

Finished task again, please review. @c3h3 @adrianliaw @ChihChengLiang