collective / transmogrify.htmlcontentextractor

This blueprint extracts out title, description and body from html either via xpath or by automatic cluster analysis
http://pypi.python.org/pypi/transmogrify.
5 stars 1 forks source link

is there a predefined way to drop element attributes? #1

Open simahawk opened 13 years ago

simahawk commented 13 years ago

Hi, I need to drop a lot of hard-coded "style" attributes in my html source: is there a parameter that takes an xpath or whatever and drops specific attributes before the import?

Thanks

djay commented 13 years ago

Standard way to drop something is in htmlcontentextractor is to create a rule for a dummy field which you'll never use. Since all the rules cut out the content they select it effectively removes that part of the html. More details in the docs https://github.com/djay/transmogrify.htmlcontentextractor/blob/master/transmogrify/htmlcontentextractor/templatefinder.txt

I've never tried it on attributes however. If it doesn't work then there should be a way to make it work :)

Failing that there is also regex find and replace feature in transmogrify.webcrawler... but regex on html is a pain.

simahawk commented 13 years ago

I solved by using http://lxml.de/lxmlhtml.html#cleaning-up-html in a custom blueprint in a custom package. I think that probably is worth to include such a blueprint into transmogrify.htmlcontentextractor and make it configurable by these paramaters http://lxml.de/api/lxml.html.clean.Cleaner-class.html. What do you think?