Closed yuya-iwabuchi closed 8 years ago
# python article_explorer.py
09/28/2015 11:16:44 PM - WARNING - error while getting links from article: None
2015-09-28 23:16:44 (Article|BBC) 1/5000
09/28/2015 11:16:44 PM - WARNING - error while doing readability parse: None
You must download() an article before parsing it!
Traceback (most recent call last):
File "article_explorer.py", line 545, in <module>
explore()
File "article_explorer.py", line 408, in explore
parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
File "article_explorer.py", line 139, in parse_articles
article.preliminary_parse()
File "/root/Voyage/src/ExplorerArticle.py", line 76, in preliminary_parse
self.newspaper_article.parse()
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/newspaper/article.py", line 156, in parse
raise ArticleException()
newspaper.article.ArticleException
# python article_explorer.py
Traceback (most recent call last):
File "article_explorer.py", line 545, in <module>
explore()
File "article_explorer.py", line 408, in explore
parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
File "article_explorer.py", line 110, in parse_articles
for article in article_iterator:
File "/root/Voyage/src/Crawler.py", line 63, in next
url = urlparse(urlnorm.norm_tuple(*parsed_as_list))
File "/root/.pyenv/versions/2.7.10/lib/python2.7/urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/root/.pyenv/versions/2.7.10/lib/python2.7/urlparse.py", line 182, in urlsplit
i = url.find(':')
AttributeError: 'tuple' object has no attribute 'find'
# python article_explorer.py
2015-09-28 23:50:06 (Article|BBC) 1/5000
2015-09-28 23:50:07 (Article|BBC) 2/5000
/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == b'\xef\xbb\xbf':
09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == b'\xef\xbb\xbf':
/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\x00\x00\xfe\xff':
09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\x00\x00\xfe\xff':
/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\xff\xfe\x00\x00':
09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\xff\xfe\x00\x00':
2015-09-28 23:50:08 (Article|BBC) 3/5000
2015-09-28 23:50:09 (Article|BBC) 4/5000
Traceback (most recent call last):
File "article_explorer.py", line 545, in <module>
explore()
File "article_explorer.py", line 408, in explore
parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
File "article_explorer.py", line 110, in parse_articles
for article in article_iterator:
File "/root/Voyage/src/Crawler.py", line 63, in next
url = urlunparse(urlnorm.norm_tuple(*parsed_as_list))
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/urlnorm.py", line 161, in norm_tuple
raise InvalidUrl('missing netloc')
urlnorm.InvalidUrl: missing netloc
2015-09-29 23:21:18 (Article|BBC) 155/5000
Traceback (most recent call last):
File "article_explorer.py", line 545, in <module>
explore()
File "article_explorer.py", line 408, in explore
parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
File "article_explorer.py", line 110, in parse_articles
for article in article_iterator:
File "/root/Voyage/src/Crawler.py", line 65, in next
url = urlunparse(urlnorm.norm_tuple(*parsed_as_list))
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/urlnorm.py", line 159, in norm_tuple
authority = norm_netloc(scheme, authority)
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/urlnorm.py", line 214, in norm_netloc
raise InvalidUrl('host %r is not valid' % host)
urlnorm.InvalidUrl: host u'\u2026' is not valid
File "article_explorer.py", line 545, in <module>
explore()
File "article_explorer.py", line 408, in explore
parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
File "article_explorer.py", line 110, in parse_articles
for article in article_iterator:
File "/root/Voyage/src/Crawler.py", line 55, in next
for url in article.get_urls():
File "/root/Voyage/src/ExplorerArticle.py", line 127, in get_urls
lxml_tree = lxml.html.fromstring(self.html)
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 706, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102435)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I traced the issue back to an rss page that the server incorrectly advertises as RSS. Since we're doing an exception handler for any uncaught exceptions (1e5d16ccda), I'm just going to ignore this as it's an edge case that warrants no fix.
File "article_explorer.py", line 552, in <module>
explore()
File "article_explorer.py", line 415, in explore
parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
File "article_explorer.py", line 110, in parse_articles
for article in article_iterator:
File "/root/Voyage/src/Crawler.py", line 55, in next
for url in article.get_urls():
File "/root/Voyage/src/ExplorerArticle.py", line 127, in get_urls
lxml_tree = lxml.html.fromstring(self.html)
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 706, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102435)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I thought I put an exception handler to fix this exact issue?
It's ran using the latest commit (b527dab) on master.
fixed 05551617fb
Test since October 7th:
CNN - ran upto 600k pages with no problem
2015-10-14 23:18:36 (Article|CNN) 604328/5000
2015-10-14 23:18:37 (Article|CNN) 604329/5000
2015-10-14 23:18:40 (Article|CNN) 604330/5000
NYT - ran upto 630k until randomly getting Killed on October 13th
2015-10-13 10:05:24 (Article|NYTimes) 633956/5000
10/13/2015 10:05:43 AM - WARNING - article skipped because download failed
10/13/2015 10:05:55 AM - WARNING - error while getting links from article: line 162: Tag footer invalid
2015-10-13 10:05:55 (Article|NYTimes) 633957/5000
10/13/2015 10:05:55 AM - WARNING - article skipped because download failed
Killed
BBC - Stuck at a page for 4 days since October 10th
10/10/2015 06:06:55 AM - INFO - Matches with filter, skipping the http://bbc.com/iplayer/
10/10/2015 06:06:55 AM - INFO - Matches with filter, skipping the http://bbc.com/programmes/#
10/10/2015 06:06:55 AM - INFO - Matches with filter, skipping the http://bbc.com/programmes/p033w3l8
10/10/2015 06:06:55 AM - INFO - visiting http://bbc.com/programmes/p033w3k9
10/10/2015 06:06:55 AM - INFO - Starting new HTTP connection (1): bbc.com
10/10/2015 06:06:55 AM - INFO - Starting new HTTP connection (1): www.bbc.com
10/10/2015 06:06:55 AM - INFO - Starting new HTTP connection (1): www.bbc.co.uk
I am not sure why the process was killed for NYT, whereas I'm guessing BBC's stuck was due to requests not having timeout default value being None.
NYT: probably out of memory
edit:
/var/log/syslog.1:Oct 13 10:06:42 utmediacat kernel: [2762343.862200] Killed process 29362 (python) total-vm:854896kB, anon-rss:693960kB, file-rss:0kB
yep.
end edit
BBC: are you sure it was stuck? I recall that if the process crashes due to some error while crawling, it will produce the same message. If it is the case that it's stuck, we'll need a debugger to find the cause (ie. installing pycharm on the server). Let me know and I'll have it set up.
NYT - good find. we'll need some solution to overcome this issue, else we won't be able to do the months long crawling. BBC - which message are you referring to? Also I'm re-running with requests having the timeout, we'll see.