DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Use readability on HTML #14

Closed bbengfort closed 7 years ago

bbengfort commented 8 years ago

Add the readability mechanism to get the good text from the HTML dump (or for insertion into mongo).

echolabstech commented 8 years ago

Looking into this.

echolabstech commented 8 years ago

See this issue.

janetriley commented 7 years ago

Resolved in #13 and #14, this issue can be closed.