alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

ReadabiliPy breaking on centerforsecuritypolicy.org #353

Closed jemrobinson closed 5 years ago

jemrobinson commented 5 years ago

For https://www.centerforsecuritypolicy.org/2007/12/11/victory-via-fuel-choice-2/, the article extraction is breaking with a ReadabiliPy error:

Traceback (most recent call last):
  File "populate_article_db.py", line 38, in <module>
    main()
  File "populate_article_db.py", line 34, in main
    use_local=args.local)
  File "/Users/jrobinson/Projects/misinformation/misinformation-crawler/misinformation/warc/warc_parser.py", line 131, in process_webpages
    response = response_from_warc(warc_data)
  File "/Users/jrobinson/Projects/misinformation/misinformation-crawler/misinformation/extractors/extract_article.py", line 41, in extract_article
    default_readability_article = simple_json_from_html_string(page_html, content_digests, node_indexes, use_readability=False)
  File "/Users/jrobinson/Projects/misinformation/misinformation-crawler/ReadabiliPy/readabilipy/simple_json.py", line 34, in simple_json_from_html_string
    "content": str(simple_tree_from_html_string(html))
  File "/Users/jrobinson/Projects/misinformation/misinformation-crawler/ReadabiliPy/readabilipy/simple_tree.py", line 42, in simple_tree_from_html_string
    insert_paragraph_breaks(soup)
  File "/Users/jrobinson/Projects/misinformation/misinformation-crawler/ReadabiliPy/readabilipy/simplifiers/html.py", line 198, in insert_paragraph_breaks
    if parent_element.name == "p":
AttributeError: 'NoneType' object has no attribute 'name'