CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

Why does this function generate duplicate images for the same source? #247

Closed bitscompagnie closed 6 years ago

bitscompagnie commented 6 years ago

Hello Pydocx community,

I am having a problem with the following function and the problem is that it is generating/saving duplicates images to a local folder for the same source (Base64 string).

How can I prevent it from generating duplicate images?

Here is the function:

def get_image_tag(self, image, width=None, height=None, rotate=None):

    img_src = self.get_image_source(image)
    if img_src:
        # Getting images from the source
        attrs = {
            'src': img_src
        }
        # get base64 file extension from bytes
        # https://matthewdaly.co.uk/blog/2015/07/04/handling-images-as-base64-strings-with-django-rest-framework/

        format, img_src2 = img_src.split(';base64,') # format ~= data:image/X,
        ext = format.split('/')[-1] # guess file extension
        # Capture the generated filename with the proper extension to use in img source attribute
        img_src_new = 'img_' + image_name() + '.' + ext
        # Function to convert base64 string to image using urlretireve
        urlretrieve(img_src, 'c:/git/output/' + img_src_new)

        # Set the image source to the newly created filename
        attrs = {
            'src': img_src_new
        }
    if width and height:
        attrs['width'] = width
        attrs['height'] = height
    if rotate:
        attrs['style'] = 'transform: rotate(%sdeg);' % rotate
    return HtmlTag('img', allow_self_closing=True, allow_whitespace=True, **attrs))

Thanks for your help.

jlward commented 6 years ago

If they are duplicate images with different names, then it makes sense that they would require multiple files. And I don't think we would want to change that.

bitscompagnie commented 6 years ago

The images are different and are from two different sources.

Here is a sample document that I tested and it is giving duplicates images.

test_doc.docx

Thanks for your reply.

jlward commented 6 years ago
<?xml version="1.0"?>
...
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image2.png"/>
<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
...
</Relationships>

The document you linked has two images displayed in the file and the rels file has paths to two images. I assume two files are being written. They are both different and they have two different names. I don't see how duplicates can be happening. Am I missing something?

ebitsdev commented 6 years ago

4 files are being written and that’s why I posted the question.

The same image is being written twice. Is there something wrong with my override of get_image_tag function?

jlward commented 6 years ago

Probably, I can't say without seeing how image_name is defined. Or without knowing what files are being written, and which ones are duplicates of which.

bitscompagnie commented 6 years ago

Here are the two functions I am using:

# function to generate random number for insertion in filename
def random_key(length):
    key = ''
    for i in range(length):
        key += random.choice(string.digits)
    return key
# Function to generate random image names
def image_name():
    """Docstring here."""
    return '{}'.format(os.path.join(IMAGE_LOCATION, random_key(4)))
jlward commented 6 years ago

Since image_name is being called every single time the parser finds an image tag, it is generating a new file name (with four random letters). I would suggest using the md5 hash of the image content instead of generating random digits. This will remove the duplicates. However, it will do the work of hashing each time you run into the tag. If you want to avoid that, you'll need to build some sort of cache on the parser object that will take the image name and use that as the key. This will also prevent duplicates.

bitscompagnie commented 6 years ago

Thanks for your feedback.

jlward commented 6 years ago

No problem :)