closup / process-xbrl

3 stars 1 forks source link

Image handling #28

Closed kwheelan closed 1 month ago

kwheelan commented 2 months ago

Goal

Correctly handle images from Word documents to avoid Arelle validation errors.

Background

Right now, we have very basic functionality to convert Word documents to iXBRL, but we need to improve the conversion. One issue is that the word-to-HTML conversion package defaults to directly imbeding images into the HTML with URIs, but that is not valid XBRL, so it doesn't validate properly. We instead want to save the images separately and then imbed the filepaths to the images in the xbrl (using <img> tags).

Tasks

lucakato commented 1 month ago

@kwheelan Hi I want to check on how this works

1) Where can I find an example of where an object is made using this class? 2) So it looks like convert_to_html takes in a Docx file. mammoth.images.img_element is calling the function I need to work on, which is self.convert_image. I want to check that you want this function to take in an 'image', which is a docx file? I think I'm not understanding exactly what this 'image' is referencing of a docx(?) file.

I was reading https://github.com/mwilliamson/mammoth.js?tab=readme-ov-file#image-converters as reference

class WordDoc:
    """ Class to represent a Word file """
    def __init__(self, docx_file):
        self.html_content = self.convert_to_html(docx_file)

def convert_to_html(self, docx_file):
    """ Use mammoth to extract content and images """
    result = mammoth.convert_to_html(docx_file, convert_image = mammoth.images.img_element(self.convert_image))

def convert_image(image):
    """ 
    Save the image; return a dictionary {"src" : <file location>}
    """
kwheelan commented 1 month ago

@lucakato

  1. Assuming you mean the WordDoc class, it's used when creating an ACFR object, which is created when processing the ACFR. More specifically:

In writing this, it's clear I need to improve my docstrings and other documentation. I'll work on this, but please feel free to add clarifying comments to the code as you go along as well.

  1. The 'image' parameter in the convert_image function refers to an individual image element from the Word (.docx) file being processed, not to the entire Word document. The image parameter is a custom object defined by mammoth. When Mammoth encounters an image in a Word document, it invokes the callback function we specify – in this case, self.convert_image.

Here is the rough process that should happen in convert_image:

def convert_image(image):
    """
    Take an image object from the Word document and:
    - Read the underlying data
    - Save it with a unique name in the relevant session folder
    - Return its relative source path
    """
    # example pseudocode for the function
    image_data = image.readAsBase64String()  # or any other method based on your implementation
    image_path = save_to_server(image_data)  # function that saves image data to server and returns the relative path
    return {"src": image_path}

Let me know if you have any other questions!

lucakato commented 1 month ago

@kwheelan thanks for the clarification it makes more sense. I tried just calling image.readAsBase64String() but I get AttributeError: 'Image' object has no attribute 'readAsBase64String' and I'm looking into that but if you have any ideas please let me know too, you can see my code in the branch.

Image type and Image content show that it is an Image object made by Mammoth. image

kwheelan commented 1 month ago

@lucakato Hmm if that built-in function isn't working, you might have to try something like image.open().read() to get the binary. Let me know if that helps or if you're still stuck. I'll have time on Monday to take a closer look if needed.

lucakato commented 1 month ago

@kwheelan Yep I changed it to

        with image.open() as image_bytes:
            encoded_src = base64.b64encode(image_bytes.read()).decode("ascii")

        return {
            "src": "data:{0};base64,{1}".format(image.content_type, encoded_src)
        }

based on https://pypi.org/project/mammoth/#image-converters still getting an error, not sure if it's to do with Mammoth or the other files like table.py

image image
kwheelan commented 1 month ago

@lucakato

The code snippet you included will encode the image directly into the converted HTML with URIs, which is what we're trying to avoid (because it isn't valid inline XBRL). We want to read the image object (probably by reading in the binary as you have, but there might be another way), and then save the image as a file, which the HTML will reference inside an <img> tag.

The error does look unrelated, but I can't tell the source from the error log. It might have to do with the format of the sample Excel you uploaded? Which document did you upload?

Keep trying a bit with this image issue, but if you can't figure it out by our meeting on Tuesday, I can assign you something else. I'll also do some research in the meantime. It's definitely not straightforward, so thanks for spending time on this.

lucakato commented 1 month ago

I think the second error was a mistake on my end I was using basic_test.xlsx, error didn't show when I used Clayton.xlsx

I will see if there's other ways. Thanks!

lucakato commented 1 month ago

@kwheelan Just to check, when you say 'save as a file' does it have to be in a file type like .png etc?

lucakato commented 1 month ago

I think I may have got it working, could you check output/9ef... please on #38 I tested this on Clayton.xlsx and like 3 docx files. Here are some of the images saved

image

While converting from Excel to inline XBRL I did still see the error comments appear a few times, although in the end everything ran. It may be this XML file syntax issue (last image pasted)

image image
kwheelan commented 1 month ago

Great! Saving these as a png is perfect. I think the errors might be because the documents don't have any comments, but we can debug it when we work on the comment parser (not a top priority for the demo).

I'll take a look at your code tomorrow morning and merge if it's ready. In the meantime, feel free to return to #7 to figure out deleting sessions folders and/or start on #37 to improve the file uploads.

kwheelan commented 1 month ago

@lucakato This code runs without errors, but I can't get it to actually render the images in the output file. Were you able to see the images in the output XBRL viewer when you ran it?

lucakato commented 1 month ago

@kwheelan Do you mean the inline viewer? I couldn't find this bar chart but I was able to view the black lines that are saved as images. image

kwheelan commented 1 month ago

Hmm I'll look again for those, but I don't see any <img> tags at all in the output html. I did move the save location for the images, so I might have broken something on my end

lucakato commented 1 month ago

@kwheelan Hi, do you know if graphs/charts are for certain to be represented using <img> tags in HTML? I tried using an online converter to see what CA-Clayton-2022-Statistical-Section would look like as html. I emailed you the file just now because I can't paste it in here. But basically I tried inspecting the source of the charts as an HTML format, and maybe this online convert is not that reliable because it may do it completely differently to Mammoth, but I don't see <img> tags for graphs. I also see in this comment that charts are not images https://github.com/mwilliamson/mammoth.js/issues/147#issuecomment-362697430

Wondering if this could be why we don't see any of those images once converted.

I also see this error about v:line and I'm thinking maybe it could be this too but I'm not sure what it's exactly referencing

image
kwheelan commented 1 month ago

@lucakato I took a quick look at the converted HTML, and it looks like the online converter maybe just used a bunch of CSS styling on spans to represent the graphs. We definitely want to use <img> tags instead (even if that requires the user to represent charts as images instead of Word charts in the original Word docs). Don't worry about the images for now -- I think I have a workable images solution for the demo (on branch 30-support-for-new-tables), but it doesn't include any of the sessions edits.

If you can close up #7 by figuring out how to delete sessions folders after use and editing any relevant download links, I'll merge it with my images solution after the demo. Post demo, we can spend some time going over the whole code base, and I'll spend time documenting stuff better for you.

lucakato commented 1 month ago

@kwheelan sounds good will work on it thank you.

kwheelan commented 1 month ago

Closed in #51