DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Make posts.htmlize() smarter #65

Open echolabstech opened 8 years ago

echolabstech commented 8 years ago

This method was originally written to wrap html snippets to look like a real web page. Now we have the ability to fetch complete web pages from RSS feeds. However In some use cases, such as when the RSS feed fails to download a web page, the old wrapping behavior will still be necessary.

Requirements: htmlize() should be either return complete webpages

  1. wrap snippets in a web page (what we already do)
  2. return complete web page