18F / domain-scan

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
Other
371 stars 137 forks source link

a11y scanner freezes on certain domains #110

Open gbinal opened 7 years ago

gbinal commented 7 years ago

The following domains break the a11y scan such that I have to stop it, remove the domain, and re-restart the scan all over again.

Two problems result:

afadvantage.gov
ama.gov
banknet.gov
biomassboard.gov
broadband.gov
dea.gov
disasterhousing.gov
export.gov
flightschoolcandidates.gov
grantsolutions.gov
gsaadvantage.gov
gsaauctions.gov
hrsa.gov
hydrogen.gov
idmanagement.gov
invasivespecies.gov
myfdicinsurance.gov
nationalbank.gov
nationalbanknet.gov
nationalhousing.gov
nationalhousinglocator.gov
nhl.gov
nls.gov
onhir.gov
pay.gov
realestatesales.gov
safetyact.gov
sciencebase.gov
segurosocial.gov
selectusa.gov
stopfakes.gov
tvaoig.gov
usdebitcard.gov
konklone commented 7 years ago

Could you paste the command line results of one of the errors?

In general, domain-scan scanners should fail gracefully, in that they print out an error or note in the saved data that it's an error, but it should never crash the scanner itself.

gbinal commented 7 years ago

Note that some of these (possibly all) are because they use meta-redirects (here's a partial list of domains that do so).

gbinal commented 7 years ago

https://github.com/18F/domain-scan/pull/114 is taking a crack at this.

gbinal commented 7 years ago

Alas, it's still happening, here's what is in the terminal after I let afadavantage.gov run for 9 hours and finally had to skip it with control + C.

[afadvantage.gov][a11y]
the_domain_is_cached: False
the_cache_is_not_forced: True
    Not cached.
[afadvantage.gov][a11y]
^CTraceback (most recent call last):
  File "./scan", line 178, in <module>
    run(options)
  File "./scan", line 84, in run
    scan_domains(scans, domains)
  File "./scan", line 142, in scan_domains
    executor.map(process_scan, tasks)
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 574, in __exit__
    self.shutdown(wait=True)
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 131, in shutdown
    t.join()
  File "/usr/lib/python3.4/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.4/threading.py", line 1076, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
Writing to cache: afadvantage.gov
Writing data for afadvantage.gov

Here's the same but for banknet.gov:

[bankhelp.gov][a11y]
the_domain_is_cached: False
the_cache_is_not_forced: True
    Not cached.
[bankhelp.gov][a11y]
Writing to cache: bankhelp.gov
Writing data for bankhelp.gov
[banknet.gov][a11y]
the_domain_is_cached: False
the_cache_is_not_forced: True
    Not cached.
[banknet.gov][a11y]
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 38, in _python_exit
    t.join()
  File "/usr/lib/python3.4/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.4/threading.py", line 1076, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
gbinal commented 7 years ago

A number of these are no longer factors b/c the DAP exclusion list removes them at an earlier stage. These still remain though:

afadvantage.gov
banknet.gov
biomassboard.gov
dea.gov
export.gov
flightschoolcandidates.gov
grantsolutions.gov
gsaadvantage.gov
gsaauctions.gov
hrsa.gov
hydrogen.gov
idmanagement.gov
nationalbank.gov
pay.gov
realestatesales.gov
safetyact.gov
sciencebase.gov
selectusa.gov
stopfakes.gov
usdebitcard.gov
konklone commented 7 years ago

@gbinal @micahsaul - I'm seeing some hangs during a11y scans too, though it's not necessarily the same as this list. Do you still see issues with these domains?

micahsaul commented 7 years ago

Augh! No, I hadn't been, I'll take a look this week.