UTMediaCAT / Voyage

Other
12 stars 5 forks source link

UnicodeDammit and Crawler #20

Closed yuya-iwabuchi closed 8 years ago

yuya-iwabuchi commented 9 years ago
# python article_explorer.py
09/28/2015 10:47:05 PM - WARNING - UnicodeDammit instance has no attribute '__len__' on http://bbc.com/  
09/28/2015 10:47:05 PM - WARNING - error while getting links from article: None  
2015-09-28 22:47:05 (Article|BBC) 1/5000  
09/28/2015 10:47:06 PM - WARNING - UnicodeDammit instance has no attribute '__len__' on http://bbc.com/  
09/28/2015 10:47:06 PM - WARNING - article skipped because download failed 
2015-09-28 22:47:06 (Article|BBC) 1/5000  
09/28/2015 10:47:06 PM - WARNING - Sleeping for 599s
yuya-iwabuchi commented 9 years ago
# python article_explorer.py
09/28/2015 11:16:44 PM - WARNING - error while getting links from article: None
2015-09-28 23:16:44 (Article|BBC) 1/5000
09/28/2015 11:16:44 PM - WARNING - error while doing readability parse: None
You must download() an article before parsing it!
Traceback (most recent call last):
  File "article_explorer.py", line 545, in <module>
    explore()
  File "article_explorer.py", line 408, in explore
    parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
  File "article_explorer.py", line 139, in parse_articles
    article.preliminary_parse()
  File "/root/Voyage/src/ExplorerArticle.py", line 76, in preliminary_parse
    self.newspaper_article.parse()
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/newspaper/article.py", line 156, in parse
    raise ArticleException()
newspaper.article.ArticleException
yuya-iwabuchi commented 9 years ago
# python article_explorer.py
Traceback (most recent call last):
  File "article_explorer.py", line 545, in <module>
    explore()
  File "article_explorer.py", line 408, in explore
    parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
  File "article_explorer.py", line 110, in parse_articles
    for article in article_iterator:
  File "/root/Voyage/src/Crawler.py", line 63, in next
    url = urlparse(urlnorm.norm_tuple(*parsed_as_list))
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'tuple' object has no attribute 'find'
yuya-iwabuchi commented 9 years ago
# python article_explorer.py
2015-09-28 23:50:06 (Article|BBC) 1/5000
2015-09-28 23:50:07 (Article|BBC) 2/5000
/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \

09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \

/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \

09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \

/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':

09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':

/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':

09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':

/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

09/28/2015 11:50:08 PM - WARNING - /root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/bs4/dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

2015-09-28 23:50:08 (Article|BBC) 3/5000
2015-09-28 23:50:09 (Article|BBC) 4/5000
Traceback (most recent call last):
  File "article_explorer.py", line 545, in <module>
    explore()
  File "article_explorer.py", line 408, in explore
    parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
  File "article_explorer.py", line 110, in parse_articles
    for article in article_iterator:
  File "/root/Voyage/src/Crawler.py", line 63, in next
    url = urlunparse(urlnorm.norm_tuple(*parsed_as_list))
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/urlnorm.py", line 161, in norm_tuple
    raise InvalidUrl('missing netloc')
urlnorm.InvalidUrl: missing netloc
yuya-iwabuchi commented 9 years ago
2015-09-29 23:21:18 (Article|BBC) 155/5000
Traceback (most recent call last):
  File "article_explorer.py", line 545, in <module>
    explore()
  File "article_explorer.py", line 408, in explore
    parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
  File "article_explorer.py", line 110, in parse_articles
    for article in article_iterator:
  File "/root/Voyage/src/Crawler.py", line 65, in next
    url = urlunparse(urlnorm.norm_tuple(*parsed_as_list))
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/urlnorm.py", line 159, in norm_tuple
    authority = norm_netloc(scheme, authority)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/urlnorm.py", line 214, in norm_netloc
    raise InvalidUrl('host %r is not valid' % host)
urlnorm.InvalidUrl: host u'\u2026' is not valid
yuya-iwabuchi commented 9 years ago
File "article_explorer.py", line 545, in <module>
    explore()
  File "article_explorer.py", line 408, in explore
    parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
  File "article_explorer.py", line 110, in parse_articles
    for article in article_iterator:
  File "/root/Voyage/src/Crawler.py", line 55, in next
    for url in article.get_urls():
  File "/root/Voyage/src/ExplorerArticle.py", line 127, in get_urls
    lxml_tree = lxml.html.fromstring(self.html)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 706, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102435)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
zhouwein commented 9 years ago

I traced the issue back to an rss page that the server incorrectly advertises as RSS. Since we're doing an exception handler for any uncaught exceptions (1e5d16ccda), I'm just going to ignore this as it's an edge case that warrants no fix.

yuya-iwabuchi commented 9 years ago
File "article_explorer.py", line 552, in <module>
    explore()
  File "article_explorer.py", line 415, in explore
    parse_articles(referring_sites, keyword_list, source_sites, source_twitter_list)
  File "article_explorer.py", line 110, in parse_articles
    for article in article_iterator:
  File "/root/Voyage/src/Crawler.py", line 55, in next
    for url in article.get_urls():
  File "/root/Voyage/src/ExplorerArticle.py", line 127, in get_urls
    lxml_tree = lxml.html.fromstring(self.html)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 706, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102435)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
zhouwein commented 9 years ago

I thought I put an exception handler to fix this exact issue?

yuya-iwabuchi commented 9 years ago

It's ran using the latest commit (b527dab) on master.

zhouwein commented 9 years ago

fixed 05551617fb

yuya-iwabuchi commented 8 years ago

Test since October 7th:

CNN - ran upto 600k pages with no problem

2015-10-14 23:18:36 (Article|CNN) 604328/5000
2015-10-14 23:18:37 (Article|CNN) 604329/5000
2015-10-14 23:18:40 (Article|CNN) 604330/5000

NYT - ran upto 630k until randomly getting Killed on October 13th

2015-10-13 10:05:24 (Article|NYTimes) 633956/5000
10/13/2015 10:05:43 AM - WARNING - article skipped because download failed
10/13/2015 10:05:55 AM - WARNING - error while getting links from article: line 162: Tag footer invalid
2015-10-13 10:05:55 (Article|NYTimes) 633957/5000
10/13/2015 10:05:55 AM - WARNING - article skipped because download failed
Killed

BBC - Stuck at a page for 4 days since October 10th

10/10/2015 06:06:55 AM - INFO - Matches with filter, skipping the http://bbc.com/iplayer/
10/10/2015 06:06:55 AM - INFO - Matches with filter, skipping the http://bbc.com/programmes/#
10/10/2015 06:06:55 AM - INFO - Matches with filter, skipping the http://bbc.com/programmes/p033w3l8
10/10/2015 06:06:55 AM - INFO - visiting http://bbc.com/programmes/p033w3k9
10/10/2015 06:06:55 AM - INFO - Starting new HTTP connection (1): bbc.com
10/10/2015 06:06:55 AM - INFO - Starting new HTTP connection (1): www.bbc.com
10/10/2015 06:06:55 AM - INFO - Starting new HTTP connection (1): www.bbc.co.uk

I am not sure why the process was killed for NYT, whereas I'm guessing BBC's stuck was due to requests not having timeout default value being None.

zhouwein commented 8 years ago

NYT: probably out of memory edit: /var/log/syslog.1:Oct 13 10:06:42 utmediacat kernel: [2762343.862200] Killed process 29362 (python) total-vm:854896kB, anon-rss:693960kB, file-rss:0kB yep. end edit

BBC: are you sure it was stuck? I recall that if the process crashes due to some error while crawling, it will produce the same message. If it is the case that it's stuck, we'll need a debugger to find the cause (ie. installing pycharm on the server). Let me know and I'll have it set up.

yuya-iwabuchi commented 8 years ago

NYT - good find. we'll need some solution to overcome this issue, else we won't be able to do the months long crawling. BBC - which message are you referring to? Also I'm re-running with requests having the timeout, we'll see.