buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.65k stars 348 forks source link

Pass LXML object straight to readability? #140

Open adbar opened 4 years ago

adbar commented 4 years ago

As of now only strings containing HTML seem to be acceptable input.

Is there a way to pass an object parsed by LXML or lxml.html (types: etree._ElementTree and html.HtmlElement) straight to Document() or should we create one?

buriy commented 4 years ago

The library changes the lxml document internally, that's why I would avoid that in a public version. When you're warned, you can now subclass Document and replace _parse method the behavior you need, e.g.

    def _parse(self, input):
        return input

or

    def _parse(self, input):
        return convert_and_deepcopy(input)

Then just use doc = Document(your_tree)

You can also make a PR which does that --checks for an input type and makes a copy of lxml document passed if it's an lxml document (or an etree document).

adbar commented 4 years ago

Thanks for the answer, I added a bypass to PR #138