TeamMsgExtractor / msg-extractor

Extracts emails and attachments saved in Microsoft Outlook's .msg files
GNU General Public License v3.0
734 stars 173 forks source link

Saving to PDF 'Message' object has no attribute 'listdir' #291

Closed jordiyeh closed 2 years ago

jordiyeh commented 2 years ago

Bug Metadata

Describe the bug I am following https://github.com/TeamMsgExtractor/msg-extractor/issues/102 and it results in

msg.save(pdf = True, wkOptions = ['-O', 'Portrait'])

'Message' object has no attribute 'listdir'

[ If applicable ] **What code did you use or can we use to reproduce this error?

extract_msg --pdf test.msg 

Is there a message.msg file you want to share to help us reproduce this?

Traceback

Error with file "test.msg": Traceback (most recent call last):
  File "/.venv/lib/python3.10/site-packages/extract_msg/message_base.py", line 863, in save
    f.write(self.getSavePdfBody(**kwargs))
  File "/.venv/lib/python3.10/site-packages/extract_msg/message_base.py", line 402, in getSavePdfBody
    raise WKError(output[1].decode('utf-8'))
extract_msg.exceptions.WKError: You need to specify at least one input file, and exactly one output file
Use - for stdin or stdout

Name:
  wkhtmltopdf 0.12.6 (with patched qt)

Synopsis:
  wkhtmltopdf [GLOBAL OPTION]... [OBJECT]... <output file>

Document objects:
  wkhtmltopdf is able to put several objects into the output file, an object is
  either a single webpage, a cover webpage or a table of contents.  The objects
  are put into the output document in the order they are specified on the
  command line, options can be specified on a per object basis or in the global
  options area. Options from the Global Options section can only be placed in
  the global options area.

  A page objects puts the content of a single webpage into the output document.

  (page)? <input url/file name> [PAGE OPTION]...
  Options for the page object can be placed in the global options and the page
  options areas. The applicable options can be found in the Page Options and 
  Headers And Footer Options sections.

  A cover objects puts the content of a single webpage into the output document,
  the page does not appear in the table of contents, and does not have headers
  and footers.

  cover <input url/file name> [PAGE OPTION]...
  All options that can be specified for a page object can also be specified for
  a cover.

  A table of contents object inserts a table of contents into the output
  document.

  toc [TOC OPTION]...
  All options that can be specified for a page object can also be specified for
  a toc, further more the options from the TOC Options section can also be
  applied. The table of contents is generated via XSLT which means that it can
  be styled to look however you want it to look. To get an idea of how to do
  this you can dump the default xslt document by supplying the
  --dump-default-toc-xsl, and the outline it works on by supplying
  --dump-outline, see the Outline Options section.

Description:
  Converts one or more HTML pages into a PDF document, using wkhtmltopdf patched
  qt.

Global Options:
      --collate                       Collate when printing multiple copies                                      (default)
      --no-collate                    Do not collate when printing multiple                                      copies
      --copies <number>               Number of copies to print into the pdf                                      file (default 1)
  -H, --extended-help                 Display more extensive help, detailing                                      less common command switches
  -g, --grayscale                     PDF will be generated in grayscale
  -h, --help                          Display help
      --license                       Output license information and exit
      --log-level <level>             Set log level to: none, error, warn or                                      info (default info)
  -l, --lowquality                    Generates lower quality pdf/ps. Useful to                                      shrink the result document space
  -O, --orientation <orientation>     Set orientation to Landscape or Portrait                                      (default Portrait)
  -s, --page-size <Size>              Set paper size to: A4, Letter, etc.                                      (default A4)
  -q, --quiet                         Be less verbose, maintained for backwards                                      compatibility; Same as using --log-level                                      none
      --read-args-from-stdin          Read command line arguments from stdin
      --title <text>                  The title of the generated pdf file (The                                      title of the first document is used if not                                      specified)
  -V, --version                       Output version information and exit
Page Options:
      --print-media-type              Use print media-type instead of screen
      --no-print-media-type           Do not use print media-type instead of                                      screen (default)
Contact:
  If you experience bugs or want to request new features please visit 
  <https://wkhtmltopdf.org/support.html>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.venv/lib/python3.10/site-packages/extract_msg/__main__.py", line 112, in main
    msg.save(**kwargs)
  File "/.venv/lib/python3.10/site-packages/extract_msg/message_base.py", line 871, in save
    self.saveRaw(path)
  File "/.venv/lib/python3.10/site-packages/extract_msg/msg.py", line 526, in saveRaw
    for dir_ in self.listdir():
AttributeError: 'Message' object has no attribute 'listdir'
TheElementalOfDestruction commented 2 years ago

Alright, so this is actually 2 errors. One was a typo (the one that happened during handling) and the other was wkhtmltopdf saying that it needs an input file for some reason. I'll fix the typo one first then see if I can't track down why that wk issue happened.

(Edit: You are also a version behind, although that's not the cause of either issue)

TheElementalOfDestruction commented 2 years ago

To clarify, you have confirmed that extract_msg --pdf test.msg is enough to get it to give that error? Does it happen on specific files or all of them? What operating system are you using and does the issue happen on a different operating system (if you can test that)?

So far I have not been able to reproduce the wk error myself, despite using the exact same code you have mentioned, and using the same version of wkhtmltopdf listed in the traceback.

Edit: Here is a list of all of the things I have tried in order to get it to fail, using version 0.36.3 of extract-msg (wkPath was sometimes omitted which used the version on the path which was older, but had the same result of no error):

with extract_msg.openMsg('test.msg') as msg:
...     msg.save(pdf = True, wkOptions = ['-O', 'Portrait'], wkPath = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
with extract_msg.openMsg('test.msg') as msg:
...     msg.save(pdf = True, wkPath = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
extract_msg --wk-path "C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" --pdf test.msg
extract_msg --wk-path "C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" --pdf --wk-options "+O Portrait" test.msg
jordiyeh commented 2 years ago

It happens on a couple of different files.msg. I am using Mac OS. The problem in my case seems to be in the following lines

process = subprocess.Popen([wkPath, *parsedWkOptions, '-', '-'], shell = True, stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr = subprocess.PIPE)

Give the program the data and wait for the program to

    # finish.
    output = process.communicate(self.getSaveHtmlBody(**kwargs))

wkPath is resolving correctly to '/usr/local/bin/wkhtmltopdf' and parsedWkOptions = ['+O', 'Portrait']

_input has the HTML of the msg

<bound method Popen._check_timeout of <Popen: returncode: 1 args: ['/usr/local/bin/wkhtmltopdf', '+O', 'Portrait',...>>

I am checking possible options for what the return code = 1

TheElementalOfDestruction commented 2 years ago

parsedWkOptions = ['+O', 'Portrait']

That would cause a problem, that should be -O not +O. The command line for extract_msg substitutes + with - for wkoptions because of issues with argparse (or should be, but something may be going wrong with your copy).

But basically the error is saying something is wrong with the listing for the input and output, and I'm guessing the +O might be the reason why

jordiyeh commented 2 years ago

I used -O, but changed it to +0.

The following change fixed the issue for me.

Instead of in message_base.py line 397

process = subprocess.Popen([wkPath, *parsedWkOptions, '-', '-'], shell = True, stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr = subprocess.PIPE)

I used

process = subprocess.Popen(' '.join([wkPath, *parsedWkOptions, '-', '-']), shell = True, stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr = subprocess.PIPE)

Does it work for you?

TheElementalOfDestruction commented 2 years ago

I'll have to test it, but I believe this will cause errors if the path has spaces in it because of how Popen works.

If you try with no options, does the original code work?

jordiyeh commented 2 years ago

If I remove the options, I get the same issue.

What about adding quotes on wkPath?

wkPath = '"' + findWk(kwargs.get('wkPath')) + '"'

The only thing I found on differences between string and list in popen is:

"Note If the cmd argument to popen2 functions is a string, the command is executed through /bin/sh. If it is a list, the command is directly executed." Ref: https://docs.python.org/3/library/subprocess.html#replacing-os-popen-os-popen2-os-popen3

TheElementalOfDestruction commented 2 years ago

Try this for the line and see how it goes. It tested working on a windows system, but still need to test it in a linux environment to ensure I won't break that (given it's the environment most commonly used for extract-msg). I'm also going to have to adjust things to either disallow bytes in the options or decode them. The list format allows a mix of bytes and strings while join does not.

process = subprocess.Popen(' '.join(f'"{x}"' if ' ' in x and x[0] != '"' else x for x in [wkPath, *parsedWkOptions, '-', '-']), shell = True, stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr = subprocess.PIPE)

This handles that the path may sometimes need to be quoted, and that other options may also sometimes need to be quoted

jordiyeh commented 2 years ago

Thanks for the clarification on why join will not decode mix bytes and strings.

The code handling the quoted path works for me now in a Mac OS environment.

TheElementalOfDestruction commented 2 years ago

Since it works on mac, I'll start doing full testing and see if I can find how the command deals with the mixed bytes and strings. Better would be if I could find why I can't manage to replicate the issue.

The issues seems to have something to do with your environment specifically, as with 0 arguments I have been completely unable to replicate the issue on windows or linux. Can you tell me if this gives you loading followed by a bunch of gibberish (that means it did not error) or whether it will give the wkhtmltopdf help (looks like what was in the beginning of your traceback, with all the options stuff):

wkhtmltopdf - - < /dev/null

The command should try to simulate what the command is doing when you give it no options.

Also just to be 100% clear, it gives the exact same traceback with the original line if you call the function like this?

msg.save(pdf = True)
TheElementalOfDestruction commented 2 years ago

given the information you have given me, it's possible your environment may have issues with a lot of modules, since list arguments to Popen are standard, so you should probably test to make sure they work at all. Best way I can think of to do this is to make a python file, then try to use Popen to run it as a subprocess and check the output. So write a script that is like this:

print('Hello world!')

And assuming that is in the current working directory that your interpreter is running in, the output of the following code should either be "Hello world!" or, as I am guessing it may end up doing, the start of the python interpreter's output when given no arguments.


import subprocess
import sys

from subprocess import PIPE

# Assuming your small test script is "my_file.py"
a = subprocess.Popen([sys.executable, 'my_file.py'], stdout = PIPE, stdin = PIPE, stderr = PIPE)
print(''.join(x.decode('utf-8') if isinstance(x, bytes) else x for x in a.communicate('')))

If your output looks something like this, that means that list arguments are completely failing on your system, something that should not be happening. Frankly the fact that I can't replicate this, nor find anyone else having this kind of problem, suggests that it is not a bug in my code but rather your environment 🤷‍♀️

Edit: Also looking at what you were mentioning for the difference between string and list, looks like that is for the function os.popen2 and not for subprocess.Popen. Looking at the code for Popen, it looks like it does not change the behavior except that it converts the list to a single string internally. And after checking the docs, it is the shell argument, which is set to true, which handles is /bin/sh is used. Given that the code that didn't work and the code that did work both use it, I suspect that is not what is causing the problem. You can test to see if setting shell = false on the original code allows that to work or not.

TheElementalOfDestruction commented 2 years ago

I've posted some new code (had to adjust the subprocess code because it apparently had a security vulnerability) to next-release. If you could use this to install from that branch and see if that code works for you or fails, that would be great. If it still fails, I'll just swap to the string parsing once I finish that (confirmed that it works) and then call it a day at this point.

pip install "git+https://github.com/TeamMsgExtractor/msg-extractor@next-release"
jordiyeh commented 2 years ago

Thanks. 0.36.4 solved the issue for me!