pa11y-crawl does not not work with subdomains

18F / pa11y-crawl

Crawl a site, run pa11y on every HTML page, and get the results

Other

18 stars 8 forks source link

pa11y-crawl does not not work with subdomains #3

Open gemfarmer opened 8 years ago

gemfarmer commented 8 years ago

I've noticed that pa11y-crawl gives the following error when attempting to crawl a URL with a subdomain.

. is not an html document, skipping

For example, https://login.gov crawls successfully, but https://useiti.doi.gov or https://18f.gsa.gov cannot find valid html to scan.

If the same projects are crawled on localhost, it crawls properly.

This is a problem on federalist URLs, because we end up seeing the following:

 ||   . is not an html document, skipping
 ||   federalist.18f.gov is not an html document, skipping
 ||   federalist.18f.gov/preview is not an html document, skipping
 ||   federalist.18f.gov/preview/18F is not an html document, skipping
 ||   federalist.18f.gov/preview/18F/18f.gsa.gov is not an html document, skipping
 ||   federalist.18f.gov/preview/18F/18f.gsa.gov/master is not an html document, skipping
 ||   federalist.18f.gov/preview/18F/18f.gsa.gov/master/index.html is not an html document, skipping

gemfarmer commented 8 years ago

After reviewing this with @waldoj, it looks like this is not related to subdomains (that was a coincidence), but likely related to how pa11y-crawl opts to use a site map if it is available. This isn't a problem when the project is being run over localhost

This is the likely offending line. It is possible that the $TEMP_DIR is saving the sitemap urls in a strange manner

cc @stvnrlly

syndy1989 commented 7 years ago

Hi, I'm new to pa11y accessability testing. i'm trying to use pa11y-crawl [URL] to find all HTML pages and runs pa11y on each one.but i'm getting the below error am i missing out anything. Any advise would be helpful. Thanks in advance.

C:\Windows\system32>pa11y-crawl nature.com 'bash' is not recognized as an internal or external command, operable program or batch file.

stvnrlly commented 7 years ago

@syndy1989 Hi there!

As an initial matter, you should know that pa11y-crawl is both experimental and unsupported, which makes it pretty fragile. You may have better success with one of the more official pa11y options, such as the "webservice".

Regarding the error that you're seeing: it looks like you're running on Windows, while this currently works on macOS. I'm not that familiar with the Windows command line, but I don't believe it supports bash natively. If you're on Windows 10, there's now a way to create a Ubuntu Linux environment and use bash. That may allow you to use this tool (though, because it's unsupported, you may still have issues).

syndy1989 commented 7 years ago

@stvnrlly Hi there, I'm actually using Windows server 2012. I tried downloading cygwin on Windows to run bash commands. I've noticed that pa11y-crawl gives the following error when attempting to crawl a URL with a subdomain.

. is not an html document, skipping

Any advice on this would be helpful. Thanks in advance

stvnrlly commented 7 years ago

I'm afraid that I won't be able to help troubleshoot that issue. If we're able to spend time working on this project in the future, we may be able to fix the problem that caused this issue to be opened in the first place, which may help with what you're seeing.