utf-8 decoding crash - Githubissues

disconnect3d commented 4 days ago

Hi, the deadcode tool crashes when it encounters non utf-8 file.

TL;DR:

(.venv) root@pwndbg:~/pwndbg# deadcode .
Traceback (most recent call last):
  File "/root/pwndbg/.venv/bin/deadcode", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/pwndbg/.venv/lib/python3.11/site-packages/deadcode/cli.py", line 20, in main
    unused_names = find_unused_names(filenames=filenames, args=args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/pwndbg/.venv/lib/python3.11/site-packages/deadcode/actions/find_unused_names.py", line 13, in find_unused_names
    dead_code_visitor.visit_abstract_syntax_trees()
  File "/root/pwndbg/.venv/lib/python3.11/site-packages/deadcode/visitor/dead_code_visitor.py", line 101, in visit_abstract_syntax_trees
    file_content = f.read()
                   ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 81: invalid start byte

This occurred when it tried to parse the following file: .venv/lib64/python3.11/site-packages/IPython/core/tests/nonascii.py which can be found here: https://github.com/ipython/ipython/blob/main/IPython/core/tests/nonascii.py

matthewdeanmartin commented 3 days ago

The open() needs to set the encoding use a better detection strategy or support utf-8 or let the user set it.

Here are the work arounds for when open() is getting the encoding from guess about the environment: https://stackoverflow.com/questions/36303919/what-encoding-does-open-use-by-default

I'd recommend setting the encoding explicitly as most linters recommend.

    def visit_abstract_syntax_trees(self) -> None:
        for file_path in self.filenames:
            with open(file_path, encoding="utf-8") as f:

disconnect3d commented 1 day ago

The open() needs to set the encoding use a better detection strategy or support utf-8 or let the user set it.

The error explicitly says "utf-8 codec can't decode byte ..." which means it attempted to read the file in utf-8 and failed. I doubt you can automagically detect encoding for an arbitrary file.

The best course of action may be just reading the file in binary form and operating on that?

Fwiw:

File content:

b'# coding: iso-8859-5\n# (Unlikely to be the default encoding for most testers.)\n# \xb1\xb6\xff\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef <- Cyrillic characters\nu = "\xae\xe2\xf0\xc4"\n'

EDIT: Huh in this case the file specifies its encoding... :)

albertas commented 1 day ago

@disconnect3d Thank you for your suggestion. I completely agree that files should be analyzed in binary. I have implemented this change via 4ab07e9 and released it in 2.3.1 version.

disconnect3d commented 1 day ago

@albertas Awesome, thanks!

albertas / deadcode

utf-8 decoding crash #10