CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

Question: How to install a specific PR, export embedded images? #242

Closed bitscompagnie closed 6 years ago

bitscompagnie commented 6 years ago

Hello,

Thanks for this great piece of software. It seems very flexible in handling MS Word documents conversion to HTML. I just discovered it yesterday and have gone through the documentation to run a few test conversions. They all seem to work without any problem on both Mac OS and Windows.

Two things that I need help with are:

  1. How do we install a specific pull request with fixes that we want implemented? Or what is the best way to merge the fixes in a pull request/commit with the stable version?
  2. Any way to export the embedded images to external files?

Thanks for your help.

jlward commented 6 years ago

Hello,

I'm glad you found our project and like it.

  1. https://stackoverflow.com/questions/20101834/pip-install-from-git-repo-branch Installing from a github branch is a common enough task that pip has made it easy to do.
  2. It is possible to have images to be experted to external files. Our implementation actually takes the images and uploads them to Amazon S3. To do so, you'll need to extend the exporter and override get_image_tag. The default can be found at pydocx.export.html. You should be able to use that as an example of what needs to be done and make any changes to images you need.

Please let us know if you have any more questions.

bitscompagnie commented 6 years ago

Thanks for replying to my question.

I have managed to install the commit from issue #231 as I need to retain text highlighted text in a table. But when I try to convert a document, I get the following error. Any suggestion about how to address below error?

Sorry for the dump.

c:\Python27\Scripts>pydocx.exe --html testdocument.docx testdocout.html Traceback (most recent call last): File "c:\Python27\Scripts\pydocx-script.py", line 11, in load_entry_point('PyDocX==0.9.10', 'console_scripts', 'pydocx')() File "c:\python27\lib\site-packages\pydocx__main.py", line 49, in cli sys.exit(main(args=sys.argv[1:]) or 0) File "c:\python27\lib\site-packages\pydocx__main__.py", line 44, in main return convert(output_type, docx_path, output_path) File "c:\python27\lib\site-packages\pydocx\main__.py", line 15, in convert output = PyDocX.to_html(docx_path) File "c:\python27\lib\site-packages\pydocx\pydocx.py", line 13, in to_html return PyDocXHTMLExporter(path_or_stream).export() File "c:\python27\lib\site-packages\pydocx\export\html.py", line 211, in expor t for result in super(PyDocXHTMLExporter, self).export() File "c:\python27\lib\site-packages\pydocx\export\html.py", line 209, in <gene xpr> result.to_html() if isinstance(result, HtmlTag) File "c:\python27\lib\site-packages\pydocx\export\base.py", line 123, in expor t for result in self.export_node(document): File "c:\python27\lib\site-packages\pydocx\export\base.py", line 218, in expor t_node for result in results: File "c:\python27\lib\site-packages\pydocx\export\html.py", line 127, in apply

for result in results:

File "c:\python27\lib\site-packages\pydocx\export\base.py", line 218, in expor t_node for result in results: File "c:\python27\lib\site-packages\pydocx\export\html.py", line 127, in apply

for result in results:

File "c:\python27\lib\site-packages\pydocx\export\base.py", line 251, in yield _nested for item in iterable: File "c:\python27\lib\site-packages\pydocx\export\base.py", line 290, in yield _numbering_spans numbering_spans = builder.get_numbering_spans() File "c:\python27\lib\site-packages\pydocx\export\numbering_span.py", line 703 , in get_numbering_spans new_items.extend(self.process_component(index, component)) File "c:\python27\lib\site-packages\pydocx\export\numbering_span.py", line 687 , in process_component for new_component in self.handle_paragraph(index, component): File "c:\python27\lib\site-packages\pydocx\export\numbering_span.py", line 677 , in handle_paragraph for item in self.handle_start_new_item(index, paragraph): File "c:\python27\lib\site-packages\pydocx\export\numbering_span.py", line 606 , in handle_start_new_item self.add_item_to_span(index) File "c:\python27\lib\site-packages\pydocx\export\numbering_span.py", line 487 , in add_item_to_span self.current_span.append_child(self.current_item) AttributeError: 'NoneType' object has no attribute 'append_child'

jlward commented 6 years ago

Does testdocument.docx successfully get converted using the version of PyDocx on pypi? Or is it only failing when installing from #231? If it's only failing with the code on 231, I would have to look through that PR to see if if there is a bug related to the new features. Additionally, if 231 is causing this issue, I would suggest the PR owner of that issue find a way to add a test case related to this use case.

bitscompagnie commented 6 years ago

Yes, it does get converted.

Now, when I install and use v0.9.10 from github, it gives above error too. But when I use v0.9.9 from github, it gets converted. So it seems that I am having trouble only with v0.9.10.

Thanks for your help.