Path.resolve() throws exception when results file doesn't exist

christhompson commented 6 years ago

From a fresh clone, running the pshtt scanner with Python 3.5.3:

scan domains.csv --scan=pshtt --workers=36

Traceback (most recent call last):
  File "./scan", line 560, in <module>
    run(options, unknown, cache_dir, results_dir)
  File "./scan", line 120, in run
    scan_domains(scans, domains, options)
  File "./scan", line 150, in scan_domains
    scanner, options, (PREFIX_HEADERS, LOCAL_HEADERS, LAMBDA_HEADERS))
  File "cthomp/HTTP-Bad-Search-Console-Message/domain-scan/utils/scan_utils.py", line 572, in begin_csv_writing
    scanner_csv_path = Path(results_dir, "%s.csv" % name).resolve()
  File "/usr/lib/python3.5/pathlib.py", line 1109, in resolve
    s = self._flavour.resolve(self)
  File "/usr/lib/python3.5/pathlib.py", line 330, in resolve
    return _resolve(base, str(path)) or sep
  File "/usr/lib/python3.5/pathlib.py", line 315, in _resolve
    target = accessor.readlink(newpath)
  File "/usr/lib/python3.5/pathlib.py", line 422, in readlink
    return os.readlink(path)
FileNotFoundError: [Errno 2] No such file or directory: 'cthomp/HTTP-Bad-Search-Console-Message/domain-scan/results/pshtt.csv'

Looking at begin_csv_writing: https://github.com/18F/domain-scan/blob/ad6ea2fadc79500eb2060f4f281550f32127e47a/utils/scan_utils.py#L572

The Path.resolve() call can't succeed since it throws FileNotFoundError if the file doesn't exist (https://docs.python.org/3.5/library/pathlib.html#pathlib.Path.resolve), and scan tries to delete the file if it already exists before this.

Removing the call to resolve() allows the scan to work. (Or, if results/ is guaranteed to exist, then this could be Path("results/").resolve().joinpath("pshtt.csv").)

If results/pshtt.csv already exists, however, the following error occurs:

Traceback (most recent call last):
  File "./scan", line 560, in <module>
    run(options, unknown, cache_dir, results_dir)
  File "./scan", line 120, in run
    scan_domains(scans, domains, options)
  File "./scan", line 134, in scan_domains
    os.remove(result)
TypeError: remove: illegal type for path parameter

Which implicates: https://github.com/18F/domain-scan/blob/ad6ea2fadc79500eb2060f4f281550f32127e47a/scan#L131-L134

Inspecting, it appears that type(result) is PosixPath, but I think os.remove() expects a string, so this should maybe be os.remove(str(result)) (https://docs.python.org/3.5/library/os.html#os.remove).

christhompson commented 6 years ago

Quick followup: This works on 3.6rc1, as they added support for Path-like objects:

Changed in version 3.6: Accepts a path-like object.

(https://docs.python.org/3/library/os.html#os.remove)

But that doesn't work in 3.5.

Similarly for Path.resolve(strict=false) in 3.6rc1: https://docs.python.org/3/library/pathlib.html#pathlib.Path.resolve

konklone commented 6 years ago

It looks like https://github.com/18F/domain-scan/commit/ccf5e040a99ab11d86e799970eb25087ff230b70 made this change without realizing the impact to Python 3.5. Thanks for reporting it.

I'm up for fixing this, but is 3.5 support important for you? I could also see us just raising the expected Python version to 3.6, since it's become relatively easy/clean to upgrade Python versions.

christhompson commented 6 years ago

Yeah, I'd say that Python 3.5 support is not that important (it's the default python3 on my workstation but I can just opt-in to Python 3.6 for now), so the easier solution is probably to just bump the minimum version requirement to Python 3.6 in the readme.

konklone commented 6 years ago

:+1: Filed https://github.com/18F/domain-scan/pull/262 to do this.

18F / domain-scan

Path.resolve() throws exception when results file doesn't exist #261