DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Sanitize HTML #13

Closed bbengfort closed 7 years ago

bbengfort commented 8 years ago

Use bleach to sanitize the post HTML to ensure there are no harmful scripts.

Either on Export or for Mongo storage.

echolabstech commented 8 years ago

Looking into this.

bbengfort commented 8 years ago

See also #14

echolabstech commented 8 years ago

Resolved here.

Also see this issue.

bbengfort commented 8 years ago

@echolabstech -- do you want to write some tests and then we can close this issue?

janetriley commented 7 years ago

I've written some tests, will submit a PR.

echolabstech commented 7 years ago

Thanks for the tests!

janetriley commented 7 years ago

Tests are addressed in issue #70. @echolabstech 's changes were merged, this ticket can be closed.