christopher-ramirez / secretary

Take the power of Jinja2 templates to OpenOffice and LibreOffice.
Other
190 stars 48 forks source link

can not open odt file generated. #7

Closed ghiewa closed 10 years ago

ghiewa commented 10 years ago

write a very simple test, can sucessfully get a odt file, but can not open it and get error hint as below,

#-*- coding:utf-8 -*-
from secretary import Render

engine = Render('simple_template.odt')

countries = 
[
    {
        'country': 'China',
        'capital': 'Beijing',
        'cities':['Suzhou', 'Wuhan', 'Hefei', 'Jiangying'],
    }
]

#Configure custom application filters
#engine.environment.filters['custom_filer'] = filter_function
result = engine.render(countries=countries)

output = open('rendered_document.odt', 'w')
output.write(result)

ERROR

The file 'rendered_document.odt' is corrupt and therefore cannot be opened. LibreOffice can try to repair the file.

The corruption could be the result of document manipulation or of structural document damage due to data transmission.

We recommend that you do not trust the content of the repaired document. Execution of macros is disabled for this document.

Should LibreOffice repair the file?

test enviroment

python 2.7.6 windows 7

christopher-ramirez commented 10 years ago

Hello! Could you provide me with the template file to check it out?

ghiewa commented 10 years ago

it is just your file in the path.

https://github.com/christopher-ramirez/secretary/blob/master/simple_template.odt

christopher-ramirez commented 10 years ago

Thanks for reporting. Actually is not the control flow the origin of the issue. It is the markdown filter (see pag. 2 in simple_template.odt. Remove this field from the template and then you should be able to create a rendered document. This is because the variables queried in the markdown filter don't exists in the template data. Anyway, this is an unexpected behaviour. I have to debug this.

christopher-ramirez commented 10 years ago

After switching branches, I could not reproduce the error again. This is kinda of weird.

ghiewa commented 10 years ago

I try to open 'rendered.otd' with zip, but I can not. I am not sure if that is the root cause or not.

ghiewa commented 10 years ago

meanwhile, I remove 'title' and 'lenght' filter, erroe is still there.

ghiewa commented 10 years ago

i have some enlighting form https://pypi.python.org/pypi/py3o.template . The only difference is she using Genshi syntax,

import lxml.etree
from genshi.template import MarkupTemplate

class Template(object):
    templated_files = ['content.xml', 'styles.xml', 'META-INF/manifest.xml']

    def __init__(self, template, outfile):
        """A template object exposes the API to render it to an OpenOffice
        document.

        @param template: a py3o template file. ie: a OpenDocument with the
        proper py3o markups
        @type template: a string representing the full path name to a py3o
        template file.

        @param outfile: the desired file name for the resulting ODT document
        @type outfile: a string representing the full filename for output
        """
        self.template = template
        self.outputfilename = outfile
        self.infile = zipfile.ZipFile(self.template, 'r')

        self.content_trees = [
            lxml.etree.parse(StringIO(self.infile.read(filename)))
            for filename in self.templated_files
        ]
        self.tree_roots = [tree.getroot() for tree in self.content_trees]

#        self.py3ocontent = lxml.etree.parse(
#            StringIO(self.infile.read("content.xml")))
#        self.py3oroot = self.py3ocontent.getroot()
        self.__prepare_namespaces()

        self.images = {}

    def render_flow(self, data):
        """render the OpenDocument with the user data

        @param data: the input stream of userdata. This should be a
        dictionnary mapping, keys being the values accessible to your
        report.
        @type data: dictionnary
        """

        newdata = dict(
            decimal=decimal,
            format_float=(lambda val: (
                isinstance(val, decimal.Decimal)
                or isinstance(val, float)
            ) and str(val).replace('.', ',') or val),
            format_percentage=(lambda val:
                ("%0.2f %%" % val).replace('.', ',')
            )
        )

        # first we need to transform the py3o template into a valid
        # Genshi template.
        starting_tags, closing_tags = self.__handle_instructions()
        for content_tree, link, py3o_base in starting_tags:
            self.__handle_link(
                content_tree,
                link,
                py3o_base,
                closing_tags[id(link)][1]
            )

        self.__prepare_userfield_decl()
        self.__prepare_usertexts()

        self.__replace_image_links()
        self.__add_images_to_manifest()

        # out = open("content.xml", "w+")
        # out.write(lxml.etree.tostring(self.py3ocontent.getroot()))
        # out.close()
        self.output_streams = list()
        for fnum, content_tree in enumerate(self.content_trees):
            template = MarkupTemplate(
                lxml.etree.tostring(content_tree.getroot())
            )
            # then we need to render the genshi template itself by
            # providing the data to genshi

            self.output_streams.append((
                self.templated_files[fnum],
                template.generate(**dict(data.items() + newdata.items())))
            )

        # then reconstruct a new ODT document with the generated content
        for status in self.__save_output():
            yield status

    def render(self, data):
        """render the OpenDocument with the user data

        @param data: the input stream of userdata. This should be a
        dictionnary mapping, keys being the values accessible to your
        report.
        @type data: dictionnary
        """
        for status in self.render_flow(data):
            if not status:
                raise ValueError("unknown error")

    def __save_output(self):
        """Saves the output into a native OOo document format.
        """
        out = zipfile.ZipFile(self.outputfilename, 'w')

        for info_zip in self.infile.infolist():

            if info_zip.filename in self.templated_files:
                # Template file - we have edited these.

                # get a temp file
                streamout = open(get_secure_filename(), "w+b")
                fname, output_stream = self.output_streams[
                    self.templated_files.index(info_zip.filename)
                ]

                # write the whole stream to it
                for chunk in output_stream.serialize():
                    streamout.write(chunk.encode('utf-8'))
                    yield True

                # close the temp file to flush all data and make sure we get
                # it back when writing to the zip archive.
                streamout.close()

                # write the full file to archive
                out.write(streamout.name, fname)

                # remove tempfile
                os.unlink(streamout.name)

            else:
                # Copy other files straight from the source archive.
                out.writestr(info_zip, self.infile.read(info_zip.filename))

        # Save images in the "Pictures" sub-directory of the archive.
        for identifier, data in self.images.iteritems():
            out.writestr(PY3O_IMAGE_PREFIX + identifier, data)

        # close the zipfile before leaving
        out.close()
        yield True
christopher-ramirez commented 10 years ago

@ghiewa, are you still having this issue?

ghiewa commented 10 years ago

yes, I am stuck. I found I can not open odt file generated by zip. Is this the root cause?

christopher-ramirez commented 10 years ago

Did you changed the generated document extension to .zip?

ghiewa commented 10 years ago

yes, I use command 'copy rendered.odt rendered.zip', but fail to open new file with zip.

christopher-ramirez commented 10 years ago

May you provide me with a copy of rendered.odt?

ghiewa commented 10 years ago

have sent you via ghiewa [at] 126.com to chris.ramirezg [at} gmail (dot] com

ghiewa commented 10 years ago

@christopher-ramirez , because system tell me it is a boken zip file when I use zip open odt file your script generated.

with zipfile.ZipFile(self.rendered, 'w') as packed_template:

to

with zipfile.ZipFile('out.odt', 'w') as packed_template:

I can get right file what I wanted. I do not know why?

ghiewa commented 10 years ago

Here is another version I revised base on your codes, It works well for me.

#!/usr/bin/python
# -*- encoding: utf-8 -*-

"""
Secretary
Take the power of Jinja2 templates to OpenOffice and LibreOffice.

This file implements Render. Render provides an interface to render
Open Document Format (ODF) documents to be used as templates using
the jinja2 template engine. To render a template:
    engine = Render(template_file)
    result = engine.render(template_var1=...)
"""
from __future__ import unicode_literals, print_function

import re
import sys
import zipfile
import io
import os
import tempfile
from cStringIO import StringIO
import lxml.etree
from xml.dom.minidom import parseString
from jinja2 import Environment, Undefined

import logging
logging.basicConfig(filename='log.log', level=logging.INFO)

def get_secure_filename():
    """creates a tempfile in the most secure manner possible,
    make sure is it closed and return the filename for
    easy usage.
    """

    file_handle, filename = tempfile.mkstemp()
    tmpfile = os.fdopen(file_handle, "r")
    tmpfile.close()
    return filename

# ---- Exceptions
class SecretaryError(Exception):
    pass

class UndefinedSilently(Undefined):
    # Silently undefined,
    # see http://stackoverflow.com/questions/6182498/jinja2-how-to-make-it-fail-silently-like-djangotemplate
    def silently_undefined(*args, **kwargs):
        return ''

    return_new = lambda *args, **kwargs: UndefinedSilently()

    __unicode__ = silently_undefined
    __str__ = silently_undefined
    __call__ = return_new
    __getattr__ = return_new

# ************************************************
#
#           SECRETARY FILTERS
#
# ************************************************

def pad_string(value, length=5):
    value = str(value)
    return value.zfill(length)

class Render(object):
    """
        Main engine to convert and ODT document into a jinja
        compatible template.

        Basic use example:
            engine = Render('template')
            result = engine.render()

        Render provides an enviroment variable which can be used
        to provide custom filters to the ODF render.

            engine = Render('template.odt')
            engine.environment.filters['custom_filer'] = filter_function
            result = engine.render()
    """

    templated_files = ['content.xml', 'styles.xml', 'META-INF/manifest.xml']

    def __init__(self, template, outfile, **kwargs):
        """
        Builds a Render instance and assign init the internal enviroment.
        Params:
            template: Either the path to the file, or a file-like object.
                      If it is a path, the file will be open with mode read 'r'.
        """

        self.template = template
        self.outputfilename = outfile

        self.environment = Environment(undefined=UndefinedSilently, autoescape=True)

        # Register provided filters
        self.environment.filters['pad'] = pad_string
        self.environment.filters['markdown'] = self.markdown_filter

    def unpack_template(self):
        """
            Loads the template into a ZIP file, allowing to make
            CRUD operations into the ZIP archive.
        """
        self.infile = zipfile.ZipFile(self.template, 'r')
        self.content_trees = [parseString(self.infile.read(filename)) for filename in self.templated_files]

        self.content = parseString(self.infile.read('content.xml'))

    def pack_document(self):
        # Save rendered content and headers
        out = zipfile.ZipFile(self.outputfilename, 'w')
        for info_zip in self.infile.infolist():
            if info_zip.filename in self.templated_files:
                streamout = open(get_secure_filename(), "w+b")
                fname, output_stream = self.output_streams[
                    self.templated_files.index(info_zip.filename)
                ]
                streamout.write(output_stream.encode('utf-8'))
                streamout.close()
                out.write(streamout.name, fname)
                os.unlink(streamout.name)
            else:
                # Copy other files straight from the source archive.
                out.writestr(info_zip, self.infile.read(info_zip.filename))
        out.close()

    def render(self, **kwargs):
        """
            Unpack and render the internal template and
            returns the rendered ODF document.
        """

        self.unpack_template()

        self.output_streams = list()
        for fnum, content_tree in enumerate(self.content_trees):
            self.prepare_template_tags(content_tree)
            template = self.environment.from_string(content_tree.toxml())
            result = template.render(**kwargs)
            self.output_streams.append((
                self.templated_files[fnum],
                result)
            )

        self.pack_document()

    def node_parents(self, node, parent_type):
        """
            Returns the first node's parent with name  of parent_type
            If parent "text:p" is not found, returns None.
        """

        if hasattr(node, 'parentNode'):
            if node.parentNode.nodeName.lower() == parent_type:
                return node.parentNode
            else:
                return self.node_parents(node.parentNode, parent_type)
        else:
            return None

    def create_text_span_node(self, xml_document, content):
        span = xml_document.createElement('text:span')
        text_node = self.create_text_node(xml_document, content)
        span.appendChild(text_node)

        return span

    def create_text_node(self, xml_document, text):
        """
        Creates a text node
        """
        return xml_document.createTextNode(text)

    def prepare_template_tags(self, xml_document):
        """
            Search every field node in the inner template and
            replace them with a <text:span> field. Flow tags are
            replaced with a blank node and moved into the ancestor
            tag defined in description field attribute.
        """
        fields = xml_document.getElementsByTagName('text:text-input')

        for field in fields:
            if field.hasChildNodes():
                field_content = field.childNodes[0].data.replace('\n', '')

                jinja_tags = re.findall(r'(\{.*?\}*})', field_content)
                if not jinja_tags:
                    # Field does not contains jinja template tags
                    continue

                field_description = field.getAttribute('text:description')

                if re.findall(r'\|markdown', field_content):
                    # a markdown should take the whole paragraph
                    field_description = 'text:p'

                if not field_description:
                    new_node = self.create_text_span_node(xml_document, field_content)
                else:
                    if field_description in \
                        ['text:p', 'table:table-row', 'table:table-cell']:
                        field = self.node_parents(field, field_description)

                    new_node = self.create_text_node(xml_document, field_content)

                parent = field.parentNode
                parent.insertBefore(new_node, field)
                parent.removeChild(field)

    def get_style_by_name(self, style_name):
        """
            Search in <office:automatic-styles> for style_name.
            Return None if style_name is not found. Otherwise
            return the style node
        """

        auto_styles = self.content.getElementsByTagName('office:automatic-styles')[0]

        if not auto_styles.hasChildNodes():
            return None

        for style_node in auto_styles.childNodes:
            if style_node.hasAttribute('style:name') and \
               (style_node.getAttribute('style:name') == style_name):
               return style_node

        return None

    def insert_style_in_content(self, style_name, attributes=None,
        **style_properties):
        """
            Insert a new style into content.xml's <office:automatic-styles> node.
            Returns a reference to the newly created node
        """

        auto_styles = self.content.getElementsByTagName('office:automatic-styles')[0]
        style_node = self.content.createElement('style:style')

        style_node.setAttribute('style:name', style_name)
        style_node.setAttribute('style:family', 'text')
        style_node.setAttribute('style:parent-style-name', 'Standard')

        if attributes:
            for k, v in attributes.iteritems():
                style_node.setAttribute('style:%s' % k, v)

        if style_properties:
            style_prop = self.content.createElement('style:text-properties')
            for k, v in style_properties.iteritems():
                style_prop.setAttribute('%s' % k, v)

            style_node.appendChild(style_prop)

        return auto_styles.appendChild(style_node)

    def markdown_filter(self, markdown_text):
        """
            Convert a markdown text into a ODT formated text
        """

        if not isinstance(markdown_text, basestring):
            return ''

        from xml.dom import Node
        from markdown_map import transform_map

        try:
            from markdown2 import markdown
        except ImportError:
            raise SecretaryError('Could not import markdown2 library. Install it using "pip install markdown2"')

        styles_cache = {}   # cache styles searching
        html_text = markdown(markdown_text)
        xml_object = parseString('<html>%s</html>' % html_text)

        # Transform HTML tags as specified in transform_map
        # Some tags may require extra attributes in ODT.
        # Additional attributes are indicated in the 'attributes' property

        for tag in transform_map:
            html_nodes = xml_object.getElementsByTagName(tag)
            for html_node in html_nodes:
                odt_node = xml_object.createElement(transform_map[tag]['replace_with'])

                # Transfer child nodes
                if html_node.hasChildNodes():
                    for child_node in html_node.childNodes:
                        odt_node.appendChild(child_node.cloneNode(True))

                # Add style-attributes defined in transform_map
                if 'style_attributes' in transform_map[tag]:
                    for k, v in transform_map[tag]['style_attributes'].iteritems():
                        odt_node.setAttribute('text:%s' % k, v)

                # Add defined attributes
                if 'attributes' in transform_map[tag]:
                    for k, v in transform_map[tag]['attributes'].iteritems():
                        odt_node.setAttribute(k, v)

                    # copy original href attribute in <a> tag
                    if tag == 'a':
                        if html_node.hasAttribute('href'):
                            odt_node.setAttribute('xlink:href',
                                html_node.getAttribute('href'))

                # Does the node need to create an style?
                if 'style' in transform_map[tag]:
                    name = transform_map[tag]['style']['name']
                    if not name in styles_cache:
                        style_node = self.get_style_by_name(name)

                        if style_node is None:
                            # Create and cache the style node
                            style_node = self.insert_style_in_content(
                                name, transform_map[tag]['style'].get('attributes', None),
                                **transform_map[tag]['style']['properties'])
                            styles_cache[name] = style_node

                html_node.parentNode.replaceChild(odt_node, html_node)

        def node_to_string(node):
            result = node.toxml()

            # linebreaks in preformated nodes should be converted to <text:line-break/>
            if (node.__class__.__name__ != 'Text') and \
                (node.getAttribute('text:style-name') == 'Preformatted_20_Text'):
                result = result.replace('\n', '<text:line-break/>')

            # All double linebreak should be replaced with an empty paragraph
            return result.replace('\n\n', '<text:p text:style-name="Standard"/>')

        return ''.join(node_as_str for node_as_str in map(node_to_string,
                xml_object.getElementsByTagName('html')[0].childNodes))

def render_template(template, **kwargs):
    """
        Render a ODF template file
    """

    engine = Render(file)
    return engine.render(**kwargs)

if __name__ == "__main__":
    import os
    from datetime import datetime

    def read(fname):
        return open(os.path.join(os.path.dirname(__file__), fname)).read()

    document = {
        'datetime': datetime.now(),
        'md_sample': read('README.md')
    }

    countries = [
        {'country': 'United States', 'capital': 'Washington', 'cities': ['miami', 'new york', 'california', 'texas', 'atlanta']},
        {'country': 'England', 'capital': 'London', 'cities': ['gales']},
        {'country': 'Japan', 'capital': 'Tokio', 'cities': ['hiroshima', 'nagazaki']},
        {'country': 'Nicaragua', 'capital': 'Managua', 'cities': ['león', 'granada', 'masaya']},
        {'country': 'Argentina', 'capital': 'Buenos aires'},
        {'country': 'Chile', 'capital': 'Santiago'},
        {'country': 'Mexico', 'capital': 'MExico City', 'cities': ['puebla', 'cancun']},
    ]

    render = Render('simple_template.odt', 'simple_template_out.odt')
    result = render.render(countries=countries, document=document)
    print("Template rendering finished! Check rendered.odt file.")
christopher-ramirez commented 10 years ago

I will take a look at that.

christopher-ramirez commented 10 years ago

I had to install a test environment using Windows 7. The cause of the error is the open mode in output = open('rendered_document.odt', 'w'). It should be changed to output = open('rendered_document.odt', 'wb').

The official open documentation states:

The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability.

In some part Python should be replacing \10 chars to invalid \13 chars into the zip files, and thus corrupting the final ODT.

I will update the sample code to force a binary write format.

Thanks for reporting.