REMitchell / python-scraping

Code samples from the book Web Scraping with Python http://shop.oreilly.com/product/0636920034391.do
4.42k stars 2.48k forks source link

Update 6 readDocx.py --- an xml parser missing #50

Open ZhanpengZhang opened 7 years ago

ZhanpengZhang commented 7 years ago

In python3.5, I've tried commonly used parsers like "html.parser" and 'lxml', but neither worked. I mean when they are used, the command wordObj.findAll("w:t") always returns an empty list [], whereas 'xml' gives back what I expect, which is

[<w:t>A Word Document on a Website</w:t>, <w:t>This is a Word document, full of content that you want very much. Unfortunately, it’s difficult to access because I’m putting it on my website as a .</w:t>, <w:t>docx</w:t>, <w:t xml:space="preserve"> file, rather than just publishing it as HTML</w:t>].

Looking forward to your reply.
This is a great book, and let's make it even better!