Image support for rtf documents

watercrossing commented 10 years ago

When currently opening a rtf document which contains an image, the image is parsed as a Paragraph, with the bytestring of the image as the Text. This makes filtering out images cumbersome - one has to filter out Texts based on their string lengths, and the representation is being messed up too.

This pull requests adds basic support for the images: A new Image class has been created, basically analogues to a text class, but it stores all relevant image metadata as defined in the rtf specifications. The parsers has been extended to fill the Image class appropriately.

I do realise that this project focuses primarily on "marked up text", so an alternative approach would be to drop images entirely, instead of putting them in a new image class.

brendonh commented 10 years ago

This looks pretty great. I'm totally okay with having images in the intermediate documents. I'm not quite so happy about having them contain rtf-specific properties, but I also don't see how it could do much better without a huge amount of work, so it's fine.

Can you add a new sample file with an image in it for demonstration?

watercrossing commented 10 years ago

I agree with you, it would be certainly preferential to drop the rtf specific instructions - but that would require thinking of something else which could handle the data. I couldn't find a python library that would abstract the image data away neatly, so I think its best left at this stage for now.

I have added a sample file, and a small script along the lines of the previous version demonstrating the behaviour.

One other point: Many text editors (LibreOffice, MS Word for example) save images in both the native format (png, jpg, emf, Quickdraw) and an uncompressed version as a Windows metafile, because WordPad (and others) can only read Windows metafiles. The test .rtf I added also has both versions. This explains the bloated file size - the original png is 11KB, the uncompressed metafile about 2MB. This pull request will just return both of images one after the other - so that the user can choose which one is wanted.

brendonh commented 10 years ago

Okay, works for me!

brendonh / pyth

Image support for rtf documents #19