cheng10 / WARC-Portal

The project is being built for digital humanities and social science researchers who wish to access web archive material in their research process.
http://warc.tech
MIT License
2 stars 2 forks source link

remove all the html tags from doc content #38

Closed cheng10 closed 7 years ago

cheng10 commented 7 years ago

bs4 did not remove some scirpt tag and tags within html comments.

heykevin commented 7 years ago

http://stackoverflow.com/questions/5598524/can-i-remove-script-tags-with-beautifulsoup I'm guessing you've tried this. Do you have an example of a before/after of the contents before it's parsed by bs4?

cheng10 commented 7 years ago

@heykevin I read this before. we can try this. thanks kevin.

heykevin commented 7 years ago

Instead of extract a lot of the other solutions mention using .decompose(). Not sure if you've looked at that. http://stackoverflow.com/a/19438513