brendonh / pyth

Python text markup and conversion
MIT License
89 stars 79 forks source link

Excludes `Image` objects when assembling plaintext content to write. #25

Open jtkiley opened 10 years ago

jtkiley commented 10 years ago

Fixes #24.

Obviously, this is the simple fix. When I looked at stopping Image from inheriting from Paragraph, I didn't get errors (and without this change, I still got the image hex in files). I'm still a little fuzzy on the finer points of the RTF spec and the reader's logic, so I probably need to clear that up before working on Image.

watercrossing commented 10 years ago

That will do the trick, even though its a bit hackish... I don't know if people would like this but it might be useful to include a snippet: {Image stripped, 123 bytes} or some other information to the text file explaining that an image used to be here?

jtkiley commented 10 years ago

I agree that it's a specific and not-at-all pretty fix. I'm just not familiar enough with pyth and the finer points of the RTF format to intelligently make changes to the design.

As for the snippet, I do a lot of content analysis, and I use pyth to process RTFs into plain text. It's probably my specific research use case, but I'm wary of adding text into a document. Also, the images in my documents are an artifact of the data provider (not the original data). It may be a good option, though. If I were looking at documents with "real" embedded images, being able to capture that fact might lead to interesting results. I would guess that a lot of use cases would similarly be interested in at least knowing about images.