18F / domain-scan

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
Other
370 stars 137 forks source link

Path.resolve() throws exception when results file doesn't exist #261

Closed christhompson closed 6 years ago

christhompson commented 6 years ago

From a fresh clone, running the pshtt scanner with Python 3.5.3:

scan domains.csv --scan=pshtt --workers=36

Traceback (most recent call last):
  File "./scan", line 560, in <module>
    run(options, unknown, cache_dir, results_dir)
  File "./scan", line 120, in run
    scan_domains(scans, domains, options)
  File "./scan", line 150, in scan_domains
    scanner, options, (PREFIX_HEADERS, LOCAL_HEADERS, LAMBDA_HEADERS))
  File "cthomp/HTTP-Bad-Search-Console-Message/domain-scan/utils/scan_utils.py", line 572, in begin_csv_writing
    scanner_csv_path = Path(results_dir, "%s.csv" % name).resolve()
  File "/usr/lib/python3.5/pathlib.py", line 1109, in resolve
    s = self._flavour.resolve(self)
  File "/usr/lib/python3.5/pathlib.py", line 330, in resolve
    return _resolve(base, str(path)) or sep
  File "/usr/lib/python3.5/pathlib.py", line 315, in _resolve
    target = accessor.readlink(newpath)
  File "/usr/lib/python3.5/pathlib.py", line 422, in readlink
    return os.readlink(path)
FileNotFoundError: [Errno 2] No such file or directory: 'cthomp/HTTP-Bad-Search-Console-Message/domain-scan/results/pshtt.csv'

Looking at begin_csv_writing: https://github.com/18F/domain-scan/blob/ad6ea2fadc79500eb2060f4f281550f32127e47a/utils/scan_utils.py#L572

The Path.resolve() call can't succeed since it throws FileNotFoundError if the file doesn't exist (https://docs.python.org/3.5/library/pathlib.html#pathlib.Path.resolve), and scan tries to delete the file if it already exists before this.

Removing the call to resolve() allows the scan to work. (Or, if results/ is guaranteed to exist, then this could be Path("results/").resolve().joinpath("pshtt.csv").)

If results/pshtt.csv already exists, however, the following error occurs:

Traceback (most recent call last):
  File "./scan", line 560, in <module>
    run(options, unknown, cache_dir, results_dir)
  File "./scan", line 120, in run
    scan_domains(scans, domains, options)
  File "./scan", line 134, in scan_domains
    os.remove(result)
TypeError: remove: illegal type for path parameter

Which implicates: https://github.com/18F/domain-scan/blob/ad6ea2fadc79500eb2060f4f281550f32127e47a/scan#L131-L134

Inspecting, it appears that type(result) is PosixPath, but I think os.remove() expects a string, so this should maybe be os.remove(str(result)) (https://docs.python.org/3.5/library/os.html#os.remove).

christhompson commented 6 years ago

Quick followup: This works on 3.6rc1, as they added support for Path-like objects:

Changed in version 3.6: Accepts a path-like object.

(https://docs.python.org/3/library/os.html#os.remove)

But that doesn't work in 3.5.

Similarly for Path.resolve(strict=false) in 3.6rc1: https://docs.python.org/3/library/pathlib.html#pathlib.Path.resolve

konklone commented 6 years ago

It looks like https://github.com/18F/domain-scan/commit/ccf5e040a99ab11d86e799970eb25087ff230b70 made this change without realizing the impact to Python 3.5. Thanks for reporting it.

I'm up for fixing this, but is 3.5 support important for you? I could also see us just raising the expected Python version to 3.6, since it's become relatively easy/clean to upgrade Python versions.

christhompson commented 6 years ago

Yeah, I'd say that Python 3.5 support is not that important (it's the default python3 on my workstation but I can just opt-in to Python 3.6 for now), so the easier solution is probably to just bump the minimum version requirement to Python 3.6 in the readme.

konklone commented 6 years ago

:+1: Filed https://github.com/18F/domain-scan/pull/262 to do this.