brendonh / pyth

Python text markup and conversion
MIT License
89 stars 79 forks source link

Hex (encoded images) in `\pict` control groups is not removed. #24

Open jtkiley opened 10 years ago

jtkiley commented 10 years ago

When I read an RTF and write it out as plain text (both with pyth), all of the hex for embedded images is included in the document. As expected, the \pict control group itself is gone.

At the moment, I'm preprocessing these files to wipe out the pict group (hex included) before using pyth, but, of course, it would be nice to avoid that. I'm not familiar enough with RTF versions to know if this is part of the 1.5 spec or a later one. However, these files run perfectly otherwise.

I can send you an example, if needed.

brendonh commented 10 years ago

Alright, so this is from this pull request, which I perhaps didn't think hard enough about: https://github.com/brendonh/pyth/pull/19/files

The stated purpose of that PR is to make it easier to filter out image data, by identifying it as such in the document object. But then none of the writers actually filter it out. Sigh.

I figure the right fix is to make Image a top-level type (instead of a Paragraph subclass), and then update the writers to ignore it. Or something.

jtkiley commented 10 years ago

Having looked again, it looks like the images are getting recognized. I see image objects inside of paragraphs, so it may be as easy just having the writers ignore it.

brendonh commented 10 years ago

Right. So there are two bugs:

  1. Image is a Paragraph subclass, leading writers to interpret it as text instead of (correctly) crashing on an unknown type
  2. Writers don't know to skip it.

Someone should fix that! ;-)

watercrossing commented 10 years ago

@brendonh You should have highlighted me, since I wrote that original image support... I am not sure making an Image a top level class is the correct approach, since images actually appear in the flow of the paragraphs, and currently pyth checks religiously that a Document only contains a Paragraph.

Before my patch #19 Image data would just be interpreted as plain text, and so the output of all the writers hasn't changed. I currently think adding functionality to the writers to ignore/handle the images is the best way forward.

brendonh commented 10 years ago

Ugh you are right on both counts. Okay, I'll do something about it.