optimizations for processing articles. improves speed for newspaper and plan b crawler.
use readability instead of newspaper for parsing when first parsing an article, which is about 3x faster. However, newspaper parsing is still needed if the article is to be added to the database. Most articles do not get added to the database (this is especially true when using plan b crawler), so there is still significant performance gains.
use sets instead of lists for faster lookup speed
in get_sources_sites: use lxml for finding links instead of regex (html isn't a regular language!)
tweaks:
search for urls and twitter authors in articles is now only done within the article content itself rather than throughout the entire article. this should reduce false-positive matches. relevant lines to revert this change:
get_sources_twitter: article.summary vs article.html
get_sources_sites: article_links_only=True vs article_links_only=False
optimizations for processing articles. improves speed for newspaper and plan b crawler.
tweaks: