bartdag / pylinkvalidator

pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.
Other
142 stars 36 forks source link

Ignore Telephone Links #16

Closed jjtroberts closed 7 years ago

jjtroberts commented 8 years ago

Is there a way to enable the linkchecker to ignore telephone links? For a site with the following link:

<a href="tel:18002524793"><span>Assisted Living<br>Sales Office</span>1-800-252-4793</a>

The linkchecker attempts to crawl http://www.theosborn.org/tel:18006732926 which returns 404. The sites my company run have multiple telephone links. This site in particular has 6 telephone links in a sidebar that renders on every single page, which results in quite a few false positives:

ERROR Crawled 1049 urls with 504 error(s) in 126.18 seconds
bartdag commented 8 years ago

There is an internal validation to only visit http and https links, but the tel: and probably mailto: are not correctly identified so the current domain is prepended to the tel link and pylinkvalidator then believes that it's a normal http link.

Shouldn't be to hard to fix. Thanks for reporting this bug!

bartdag commented 8 years ago

I take back what I said, this is already supported in pylinkvalidator: tel and mailto links should not be crawled. Which version of pylinkvalidator did you try? There is even a unit test with a mailto and tel link. I saw you also opened an issue on pylinkchecker, but the two codebases are now quite different.

jjtroberts commented 8 years ago

I was using pylinkchecker until I opened the issue there and started reading some of the comments on other issues. That's how I found your fork. I'm using version 0.3 of pylinkvalidate:

[root@plgdinfra02 scripts]# pylinkvalidator/pylinkvalidator/bin/pylinkvalidate.py --version
pylinkvalidate.py 0.3

Here's what it outputs:

pylinkvalidate.py -P -N -w 5 http://www.theosborn.org
Starting crawl...
200 - http://www.theosborn.org (1 of 88 - 1%)
404 - http://www.theosborn.org/tel:18006732926 (2 of 88 - 2%)
404 - http://www.theosborn.org/tel:19149258000 (3 of 88 - 3%)
404 - http://www.theosborn.org/tel:18007216695 (4 of 88 - 5%)
404 - http://www.theosborn.org/tel:18002524793 (5 of 88 - 6%)
404 - http://www.theosborn.org/tel:18005108895 (6 of 88 - 7%)
404 - http://www.theosborn.org/tel:12032921546 (7 of 88 - 8%)
200 - http://www.theosborn.org/about/ (8 of 88 - 9%)
200 - http://www.theosborn.org/news/ (9 of 88 - 10%)
200 - http://www.theosborn.org/events/ (10 of 88 - 11%)
200 - http://www.theosborn.org/careers/ (11 of 88 - 12%)
200 - http://www.theosborn.org/giving/ (12 of 88 - 14%)
200 - http://www.theosborn.org/westchester-county-retirement-community/ (13 of 88 - 15%)
200 - http://www.theosborn.org/location/ (14 of 88 - 16%)
200 - http://www.theosborn.org/activities/ (15 of 88 - 17%)
200 - http://www.theosborn.org/dining/ (16 of 88 - 18%)
200 - http://www.theosborn.org/faq/ (17 of 88 - 19%)
200 - http://www.theosborn.org/map/ (18 of 88 - 20%)
200 - http://www.theosborn.org/testimonials/ (19 of 88 - 22%)
200 - http://www.theosborn.org/miriams-attic/ (20 of 88 - 23%)
200 - http://www.theosborn.org/westchester-county-senior-apartments/ (21 of 88 - 24%)
200 - http://www.theosborn.org/westchester-county-independent-living/ (22 of 88 - 25%)
200 - http://www.theosborn.org/westchester-county-senior-housing/ (23 of 88 - 26%)
200 - http://www.theosborn.org/westchester-county-senior-care/ (24 of 88 - 27%)
200 - http://www.theosborn.org/westchester-county-assisted-living/ (25 of 88 - 28%)
200 - http://www.theosborn.org/westchester-county-memory-care/ (26 of 88 - 30%)
200 - http://www.theosborn.org/westchester-county-skilled-nursing/ (27 of 88 - 31%)
200 - http://www.theosborn.org/westchester-county-rehabilitation/ (28 of 88 - 32%)
200 - http://www.theosborn.org/westchester-county-hospice/ (29 of 88 - 33%)
200 - http://www.theosborn.org/home-care/ (30 of 88 - 34%)
200 - http://www.theosborn.org/westchester-county-respite/ (31 of 88 - 35%)
200 - http://www.theosborn.org/home-care-westchester-county-ny/ (32 of 88 - 36%)
200 - http://www.theosborn.org/home-care-fairfield-county-ct/ (33 of 88 - 38%)
200 - http://www.theosborn.org/leadership/ (34 of 88 - 39%)
200 - http://www.theosborn.org/resources/ (35 of 88 - 40%)
200 - http://www.theosborn.org/ohc-faq/ (36 of 88 - 41%)
200 - http://www.theosborn.org/gallery/ (37 of 88 - 42%)
200 - http://www.theosborn.org/gallery/photo-gallery/ (38 of 88 - 43%)
200 - http://www.theosborn.org/gallery/virtual-tour/ (39 of 88 - 44%)
200 - http://www.theosborn.org/contact-us/ (40 of 88 - 45%)
200 - http://www.theosborn.org/directions/ (41 of 88 - 47%)
200 - http://www.theosborn.org/privacy-policy/ (42 of 88 - 48%)
200 - http://www.theosborn.org/ (43 of 88 - 49%)
200 - http://www.theosborn.org/westchester-county-assisted-living-apartments/ (44 of 88 - 50%)
200 - http://www.theosborn.org/accreditation/ (45 of 88 - 51%)
200 - http://www.theosborn.org/history/ (46 of 88 - 52%)
200 - http://www.theosborn.org/event/summer-outdoor-concert-series-tuesdays-7-pm/ (47 of 88 - 53%)
200 - http://www.theosborn.org/contact-us-download/ (48 of 88 - 55%)
200 - http://www.theosborn.org/gallery/photo-gallery/ (48 of 87 - 55%)
200 - http://www.theosborn.org/additional-services/ (49 of 87 - 56%)
200 - http://www.theosborn.org/scholarship/ (50 of 87 - 57%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/the-osborn.jpg (51 of 87 - 59%)
200 - http://www.theosborn.org/wp-content/uploads/2013/01/osb-home_new.jpg (52 of 87 - 60%)
200 - http://www.theosborn.org/wp-content/uploads/2013/01/osb-gallerycallout_new.jpg (53 of 87 - 61%)
200 - http://www.theosborn.org/wp-content/uploads/2013/01/broshures.jpg (54 of 87 - 62%)
404 - http://www.theosborn.org/tel:9149258000 (55 of 87 - 63%)
200 - http://www.theosborn.org/event/families-managing-dementia-related-decline/ (56 of 87 - 64%)
200 - http://www.theosborn.org/sitemap/ (57 of 87 - 66%)
200 - http://www.theosborn.org/wp-content/uploads/2014/10/FB-f-Logo__white_50.png (58 of 87 - 67%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/ada.png (59 of 87 - 68%)
200 - http://www.theosborn.org/privacy-policy/ (59 of 86 - 69%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/eho.png (60 of 86 - 70%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/carf-ccac.png (61 of 86 - 71%)
200 - http://www.theosborn.org/wp-content/plugins/gd-worker/public/js/gd-worker-public.js?ver=1.0.0 (62 of 86 - 72%)
200 - http://www.theosborn.org/wp-content/plugins/gravity-forms-auto-placeholders/modernizr.placeholder.min.js?ver=1.2 (63 of 86 - 73%)
200 - http://www.theosborn.org/wp-content/plugins/gravity-forms-auto-placeholders/scripts.js?ver=1.2 (64 of 86 - 74%)
200 - http://www.theosborn.org/wp-content/plugins/slide-in/js/wdsi.js?ver=1.2 (65 of 86 - 76%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.matchHeight-min.js?ver=1.0 (66 of 86 - 77%)
200 - http://www.theosborn.org/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 (67 of 86 - 78%)
200 - http://www.theosborn.org/wp-includes/js/jquery/ui/core.min.js?ver=1.11.4 (68 of 86 - 79%)
200 - http://www.theosborn.org/wp-includes/js/jquery/ui/widget.min.js?ver=1.11.4 (69 of 86 - 80%)
200 - http://www.theosborn.org/wp-includes/js/jquery/ui/accordion.min.js?ver=1.11.4 (70 of 86 - 81%)
200 - http://www.theosborn.org/wp-includes/js/wp-embed.min.js?ver=ffa3821e3d07f071d4c8934f4e0a1c62 (71 of 86 - 83%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.flexslider-min.js (72 of 86 - 84%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.magnific-popup.min.js (73 of 86 - 85%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/osborn.js (74 of 86 - 86%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/favicon.ico (75 of 86 - 87%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/styles/magnific-popup.css (76 of 86 - 88%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/style.css?v=2.0 (77 of 86 - 90%)
200 - http://www.theosborn.org/wp-content/plugins/gd-worker/public/css/gd-worker-public.css?ver=1.0.0 (78 of 86 - 91%)
200 - http://www.theosborn.org/wp-content/plugins/slide-in/css/wdsi.css?ver=1.2 (79 of 86 - 92%)
200 - http://www.theosborn.org/wp-content/plugins/ultimate-branding/ultimate-branding-files/modules/custom-admin-bar-files/css/general.css?ver=1.0 (80 of 86 - 93%)
200 - http://www.theosborn.org/wp-content/plugins/ultimate-branding/ultimate-branding-files/modules/favicons/css/admin.css?ver=1.0.0 (81 of 86 - 94%)
200 - http://www.theosborn.org/wp-content/themes/the-osborn/styles/flexslider.css (82 of 86 - 95%)
200 - http://www.theosborn.org/wp-includes/wlwmanifest.xml (83 of 86 - 97%)
200 - http://www.theosborn.org/wp-content/uploads/2015/10/door-knob_O_32x32.png?87261ec6721344e609568fab5cba4fbd (84 of 86 - 98%)
200 - http://www.theosborn.org/wp-json/ (85 of 86 - 99%)
200 - http://www.theosborn.org/xmlrpc.php?rsd (86 of 86 - 100%)
Crawling Done...

ERROR Crawled 86 urls with 7 error(s) in 14.37 seconds
  average response time: 0.69 seconds
  average process time: 0.29 seconds

  Start URL(s): http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:18007216695
    from http://www.theosborn.org
    from http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:18006732926
    from http://www.theosborn.org
    from http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:19149258000
    from http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:9149258000
    from http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:18005108895
    from http://www.theosborn.org
    from http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:18002524793
    from http://www.theosborn.org
    from http://www.theosborn.org
    from http://www.theosborn.org
    from http://www.theosborn.org

  not found (404): http://www.theosborn.org/tel:12032921546
    from http://www.theosborn.org
    from http://www.theosborn.org
bartdag commented 8 years ago

Thanks a lot for the detailed bug report. I see the problem now. The unit test has a wrong tel link (tel:foo@bar.com instead of tel:1234567890) and Python urlsplit does not correctly parse the real tel link.

bartdag commented 8 years ago

After further research, tel:1203292154 is not a valid tel URI. It should be tel:+1203292154 (and in that case, it would correctly be parsed by Python and ignored by pylinkvalidator).

Browsers usually interpret these URIs as tel URI even though they are malformed. I could thus add an option to try to detect them.

jjtroberts commented 7 years ago

Adding an option to ignore tel: and mailto: would be helpful. I doubt I could convince my producers to go back through all of our client sites (200+) and add a "+1" to each tel: value.

bartdag commented 7 years ago

@gd-jroberts can you try the latest commit to see if it fixes your issue? I added an option, -b (or --ignore-bad-tel-urls) that ignores badly formed tel URLs in the unit tests, but a real-world test would be even better.

Interestingly, it seems that Python 2.6 urlparse function recognized all types of tel: URLs (e.g., tel:1234567890 and tel:+1234567890), but it was "fixed" in Python 2.7.

jjtroberts commented 7 years ago

Sorry for missing your last update:

I cloned master, ran setup.py install and executed the same command as last time with the following results:

`$ ./pylinkvalidate.py -P -N -w 5 http://www.theosborn.org Starting crawl... 200 - http://www.theosborn.org (1 of 91 - 1%) 404 - http://www.theosborn.org/tel:19149258000 (2 of 91 - 2%) 404 - http://www.theosborn.org/tel:18002524793 (3 of 91 - 3%) 404 - http://www.theosborn.org/tel:18005108895 (4 of 91 - 4%) 404 - http://www.theosborn.org/tel:18007216695 (5 of 91 - 5%) 404 - http://www.theosborn.org/tel:18006732926 (6 of 91 - 7%) 404 - http://www.theosborn.org/tel:12032921546 (7 of 91 - 8%) 404 - http://www.theosborn.org/tel:18008500196 (8 of 91 - 9%) 200 - http://www.theosborn.org/about/ (9 of 91 - 10%) 200 - http://www.theosborn.org/events/ (10 of 91 - 11%) 200 - http://www.theosborn.org/news/ (11 of 91 - 12%) 200 - http://www.theosborn.org/giving/ (12 of 91 - 13%) 200 - http://www.theosborn.org/careers/ (13 of 91 - 14%) 200 - http://www.theosborn.org/activities/ (14 of 91 - 15%) 200 - http://www.theosborn.org/location/ (15 of 91 - 16%) 200 - http://www.theosborn.org/westchester-county-retirement-community/ (16 of 91 - 18%) 200 - http://www.theosborn.org/dining/ (17 of 91 - 19%) 200 - http://www.theosborn.org/miriams-attic/ (18 of 91 - 20%) 200 - http://www.theosborn.org/faq/ (19 of 91 - 21%) 200 - http://www.theosborn.org/testimonials/ (20 of 91 - 22%) 200 - http://www.theosborn.org/map/ (21 of 91 - 23%) 200 - http://www.theosborn.org/westchester-county-senior-apartments/ (22 of 91 - 24%) 200 - http://www.theosborn.org/westchester-county-senior-housing/ (23 of 91 - 25%) 200 - http://www.theosborn.org/westchester-county-independent-living/ (24 of 91 - 26%) 200 - http://www.theosborn.org/westchester-county-senior-care/ (25 of 91 - 27%) 200 - http://www.theosborn.org/westchester-county-memory-care/ (26 of 91 - 29%) 200 - http://www.theosborn.org/westchester-county-skilled-nursing/ (27 of 91 - 30%) 200 - http://www.theosborn.org/westchester-county-assisted-living/ (28 of 91 - 31%) 200 - http://www.theosborn.org/westchester-county-senior-care/westchester-county-rehabilitation/ (29 of 91 - 32%) 200 - http://www.theosborn.org/westchester-county-hospice/ (30 of 91 - 33%) 200 - http://www.theosborn.org/westchester-county-respite/ (31 of 91 - 34%) 200 - http://www.theosborn.org/home-care/ (32 of 91 - 35%) 200 - http://www.theosborn.org/home-care-westchester-county-ny/ (33 of 91 - 36%) error - http://www.theosborn.org/home-care-fairfield-county-ct/ (34 of 91 - 37%) error - http://www.theosborn.org/leadership/ (35 of 91 - 38%) error - http://www.theosborn.org/resources/ (36 of 91 - 40%) error - http://www.theosborn.org/ohc-faq/ (37 of 91 - 41%) error - http://www.theosborn.org/gallery/ (38 of 91 - 42%) error - http://www.theosborn.org/gallery/photo-gallery/ (39 of 91 - 43%) error - http://www.theosborn.org/gallery/virtual-tour/ (40 of 91 - 44%) error - http://www.theosborn.org/contact-us/ (41 of 91 - 45%) error - http://www.theosborn.org/directions/ (42 of 91 - 46%) error - http://www.theosborn.org/privacy-policy/ (43 of 91 - 47%) error - http://www.theosborn.org/ (44 of 91 - 48%) error - http://www.theosborn.org/westchester-county-assisted-living-apartments/ (45 of 91 - 49%) error - http://www.theosborn.org/history/ (46 of 91 - 51%) error - http://www.theosborn.org/accreditation/ (47 of 91 - 52%) error - http://www.theosborn.org/scholarship/ (48 of 91 - 53%) error - http://www.theosborn.org/2016/10/19/matt-anderson-presents-wellspring-osborn/ (49 of 91 - 54%) error - http://www.theosborn.org/event/alexandra-zapruder-author-twenty-six-seconds-personal-history-zapruder-film/ (50 of 91 - 55%) error - http://www.theosborn.org/photo-gallery/ (51 of 91 - 56%) error - http://www.theosborn.org/contact-us-download/ (52 of 91 - 57%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/the-osborn.jpg (53 of 91 - 58%) error - http://www.theosborn.org/additional-services/ (54 of 91 - 59%) 200 - http://www.theosborn.org/wp-content/uploads/2013/01/Homepage3.jpg (55 of 91 - 60%) 200 - http://www.theosborn.org/wp-content/uploads/2013/01/osb-gallerycallout_new.jpg (56 of 91 - 62%) 200 - http://www.theosborn.org/wp-content/uploads/2013/01/broshures.jpg (57 of 91 - 63%) 200 - http://www.theosborn.org/wp-content/uploads/2014/10/FB-f-Logo__white_50.png (58 of 91 - 64%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/ada.png (59 of 91 - 65%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/eho.png (60 of 91 - 66%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/carf-ccac.png (61 of 91 - 67%) 200 - http://www.theosborn.org/wp-content/plugins/gd-worker/public/js/gd-worker-public.js?ver=1.0.0 (62 of 91 - 68%) 200 - http://www.theosborn.org/wp-content/plugins/gravity-forms-auto-placeholders/modernizr.placeholder.min.js?ver=1.2 (63 of 91 - 69%) 200 - http://www.theosborn.org/wp-content/plugins/gravity-forms-auto-placeholders/scripts.js?ver=1.2 (64 of 91 - 70%) 200 - http://www.theosborn.org/wp-content/plugins/slide-in/js/wdsi.js?ver=1.2 (65 of 91 - 71%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.matchHeight-min.js?ver=1.0 (66 of 91 - 73%) 200 - http://www.theosborn.org/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.4.1 (67 of 91 - 74%) 200 - http://www.theosborn.org/wp-includes/js/jquery/ui/core.min.js?ver=1.11.4 (68 of 91 - 75%) 200 - http://www.theosborn.org/wp-includes/js/jquery/ui/widget.min.js?ver=1.11.4 (69 of 91 - 76%) 200 - http://www.theosborn.org/wp-includes/js/jquery/ui/accordion.min.js?ver=1.11.4 (70 of 91 - 77%) 200 - http://www.theosborn.org/wp-includes/js/wp-embed.min.js?ver=c39570c078c67f50cfcafeebaf91152d (71 of 91 - 78%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.flexslider-min.js (72 of 91 - 79%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.magnific-popup.min.js (73 of 91 - 80%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/osborn.js (74 of 91 - 81%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/favicon.ico (75 of 91 - 82%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/styles/flexslider.css (76 of 91 - 84%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/styles/magnific-popup.css (77 of 91 - 85%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/style.css?v=2.0 (78 of 91 - 86%) 200 - http://www.theosborn.org/wp-content/plugins/gd-worker/public/css/gd-worker-public.css?ver=1.0.0 (79 of 91 - 87%) 200 - http://www.theosborn.org/wp-content/plugins/slide-in/css/wdsi.css?ver=1.2 (80 of 91 - 88%) 200 - http://www.theosborn.org/wp-content/plugins/ultimate-branding/ultimate-branding-files/modules/custom-admin-bar-files/css/general.css?ver=1.0 (81 of 91 - 89%) 200 - http://www.theosborn.org/wp-content/plugins/ultimate-branding/ultimate-branding-files/modules/favicons/css/admin.css?ver=1.0.0 (82 of 91 - 90%) error - http://www.theosborn.org/tel:9149258000 (83 of 91 - 91%) 200 - http://www.theosborn.org/wp-includes/wlwmanifest.xml (84 of 91 - 92%) error - http://www.theosborn.org/sitemap/ (85 of 91 - 93%) error - http://www.theosborn.org/contact-us/privacy-policy/ (86 of 91 - 95%) 200 - http://www.theosborn.org/wp-content/uploads/2015/10/door-knob_O_32x32.png (87 of 91 - 96%) error - http://www.theosborn.org/wp-json/ (88 of 91 - 97%) error - http://www.theosborn.org/xmlrpc.php?rsd (89 of 91 - 98%) error - http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F (90 of 91 - 99%) 200 - http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F&format=xml (91 of 91 - 100%) Crawling Done...

ERROR Crawled 91 urls with 33 error(s) in 70.52 seconds

Start URL(s): http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:19149258000 from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/photo-gallery/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/event/alexandra-zapruder-author-twenty-six-seconds-personal-history-zapruder-film/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/contact-us/ from http://www.theosborn.org from http://www.theosborn.org from http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:18005108895 from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/ from http://www.theosborn.org from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/ohc-faq/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/leadership/ from http://www.theosborn.org from http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:18007216695 from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/westchester-county-assisted-living-apartments/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/xmlrpc.php?rsd from http://www.theosborn.org

error (timeout): http://www.theosborn.org/resources/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/wp-json/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/additional-services/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/sitemap/ from http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:18002524793 from http://www.theosborn.org from http://www.theosborn.org from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/contact-us/privacy-policy/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/home-care-fairfield-county-ct/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/history/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/tel:9149258000 from http://www.theosborn.org

error (timeout): http://www.theosborn.org/gallery/ from http://www.theosborn.org from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/gallery/photo-gallery/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/privacy-policy/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/gallery/virtual-tour/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/scholarship/ from http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:12032921546 from http://www.theosborn.org from http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:18006732926 from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/contact-us-download/ from http://www.theosborn.org

not found (404): http://www.theosborn.org/tel:18008500196 from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/2016/10/19/matt-anderson-presents-wellspring-osborn/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/accreditation/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F from http://www.theosborn.org

error (timeout): http://www.theosborn.org/directions/ from http://www.theosborn.org from http://www.theosborn.org`

bartdag commented 7 years ago

@gd-jroberts I believe you are missing the --ignore-bad-tel-urls flag

jjtroberts commented 7 years ago

facepalm +1 for following directions.

./pylinkvalidate.py --ignore-bad-tel-urls -P -N -w 5 http://www.theosborn.org Usage: pylinkvalidate.py [options] URL ...

pylinkvalidate.py: error: no such option: --ignore-bad-tel-urls

Did I not build it correctly?

$ git log commit c8a0c6efbe6c351795ad24ed04a0eb6c9bf1e387 Author: Barthelemy Dagenais barthelemy@infobart.com Date: Fri Jun 23 10:34:44 2017 -0400

fixes #16 - added --ignore-bad-tel-urls option
bartdag commented 7 years ago

@gd-jroberts Here is a quick way to test without installing the package:

git clone https://github.com/bartdag/pylinkvalidator.git pylinkvalidator-new
cd pylinkvalidator-new
export PYTHONPATH=.
./pylinkvalidator/bin/pylinkvalidate.py -P -N -w 5 --ignore-bad-tel-urls http://www.theosborn.org

I think you are invoking the script from @master but it uses the modules already installed (and not the one in @master).

jjtroberts commented 7 years ago

`$ ./pylinkvalidator/bin/pylinkvalidate.py -P -N -w 5 --ignore-bad-tel-urls http://www.theosborn.org Starting crawl... 200 - http://www.theosborn.org (1 of 83 - 1%) 200 - http://www.theosborn.org/events/ (2 of 83 - 2%) 200 - http://www.theosborn.org/news/ (3 of 83 - 4%) 200 - http://www.theosborn.org/giving/ (4 of 83 - 5%) 200 - http://www.theosborn.org/about/ (5 of 83 - 6%) 200 - http://www.theosborn.org/careers/ (6 of 83 - 7%) 200 - http://www.theosborn.org/westchester-county-retirement-community/ (7 of 83 - 8%) 200 - http://www.theosborn.org/location/ (8 of 83 - 10%) 200 - http://www.theosborn.org/activities/ (9 of 83 - 11%) 200 - http://www.theosborn.org/dining/ (10 of 83 - 12%) 200 - http://www.theosborn.org/map/ (11 of 83 - 13%) 200 - http://www.theosborn.org/faq/ (12 of 83 - 14%) 200 - http://www.theosborn.org/miriams-attic/ (13 of 83 - 16%) 200 - http://www.theosborn.org/westchester-county-independent-living/ (14 of 83 - 17%) 200 - http://www.theosborn.org/testimonials/ (15 of 83 - 18%) 200 - http://www.theosborn.org/westchester-county-senior-apartments/ (16 of 83 - 19%) 200 - http://www.theosborn.org/westchester-county-senior-housing/ (17 of 83 - 20%) 200 - http://www.theosborn.org/westchester-county-senior-care/ (18 of 83 - 22%) 200 - http://www.theosborn.org/westchester-county-assisted-living/ (19 of 83 - 23%) 200 - http://www.theosborn.org/westchester-county-memory-care/ (20 of 83 - 24%) 200 - http://www.theosborn.org/westchester-county-skilled-nursing/ (21 of 83 - 25%) 200 - http://www.theosborn.org/westchester-county-senior-care/westchester-county-rehabilitation/ (22 of 83 - 27%) 200 - http://www.theosborn.org/westchester-county-hospice/ (23 of 83 - 28%) 200 - http://www.theosborn.org/westchester-county-respite/ (24 of 83 - 29%) 200 - http://www.theosborn.org/home-care/ (25 of 83 - 30%) 200 - http://www.theosborn.org/leadership/ (26 of 83 - 31%) 200 - http://www.theosborn.org/home-care-westchester-county-ny/ (27 of 83 - 33%) 200 - http://www.theosborn.org/home-care-fairfield-county-ct/ (28 of 83 - 34%) 200 - http://www.theosborn.org/resources/ (29 of 83 - 35%) 200 - http://www.theosborn.org/gallery/photo-gallery/ (30 of 83 - 36%) 200 - http://www.theosborn.org/ohc-faq/ (31 of 83 - 37%) 200 - http://www.theosborn.org/gallery/virtual-tour/ (32 of 83 - 39%) 200 - http://www.theosborn.org/gallery/ (33 of 83 - 40%) 200 - http://www.theosborn.org/contact-us/ (34 of 83 - 41%) 200 - http://www.theosborn.org/privacy-policy/ (35 of 83 - 42%) 200 - http://www.theosborn.org/directions/ (36 of 83 - 43%) error - http://www.theosborn.org/ (37 of 83 - 45%) error - http://www.theosborn.org/westchester-county-assisted-living-apartments/ (38 of 83 - 46%) error - http://www.theosborn.org/history/ (39 of 83 - 47%) error - http://www.theosborn.org/accreditation/ (40 of 83 - 48%) error - http://www.theosborn.org/scholarship/ (41 of 83 - 49%) error - http://www.theosborn.org/2016/10/19/matt-anderson-presents-wellspring-osborn/ (42 of 83 - 51%) error - http://www.theosborn.org/event/alexandra-zapruder-author-twenty-six-seconds-personal-history-zapruder-film/ (43 of 83 - 52%) error - http://www.theosborn.org/photo-gallery/ (44 of 83 - 53%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/the-osborn.jpg (45 of 83 - 54%) 200 - http://www.theosborn.org/wp-content/uploads/2013/01/Homepage3.jpg (46 of 83 - 55%) error - http://www.theosborn.org/contact-us-download/ (47 of 83 - 57%) error - http://www.theosborn.org/additional-services/ (48 of 83 - 58%) 200 - http://www.theosborn.org/wp-content/uploads/2013/01/osb-gallerycallout_new.jpg (49 of 83 - 59%) 200 - http://www.theosborn.org/wp-content/uploads/2013/01/broshures.jpg (50 of 83 - 60%) 200 - http://www.theosborn.org/wp-content/uploads/2014/10/FB-f-Logo__white_50.png (51 of 83 - 61%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/ada.png (52 of 83 - 63%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/eho.png (53 of 83 - 64%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/images/carf-ccac.png (54 of 83 - 65%) 200 - http://www.theosborn.org/wp-content/plugins/gd-worker/public/js/gd-worker-public.js?ver=1.0.0 (55 of 83 - 66%) 200 - http://www.theosborn.org/wp-content/plugins/gravity-forms-auto-placeholders/modernizr.placeholder.min.js?ver=1.2 (56 of 83 - 67%) 200 - http://www.theosborn.org/wp-content/plugins/slide-in/js/wdsi.js?ver=1.2 (57 of 83 - 69%) 200 - http://www.theosborn.org/wp-content/plugins/gravity-forms-auto-placeholders/scripts.js?ver=1.2 (58 of 83 - 70%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.matchHeight-min.js?ver=1.0 (59 of 83 - 71%) 200 - http://www.theosborn.org/wp-includes/js/jquery/ui/core.min.js?ver=1.11.4 (60 of 83 - 72%) 200 - http://www.theosborn.org/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.4.1 (61 of 83 - 73%) 200 - http://www.theosborn.org/wp-includes/js/jquery/ui/widget.min.js?ver=1.11.4 (62 of 83 - 75%) 200 - http://www.theosborn.org/wp-includes/js/jquery/ui/accordion.min.js?ver=1.11.4 (63 of 83 - 76%) 200 - http://www.theosborn.org/wp-includes/js/wp-embed.min.js?ver=c39570c078c67f50cfcafeebaf91152d (64 of 83 - 77%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.flexslider-min.js (65 of 83 - 78%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/jquery.magnific-popup.min.js (66 of 83 - 80%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/scripts/osborn.js (67 of 83 - 81%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/favicon.ico (68 of 83 - 82%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/styles/flexslider.css (69 of 83 - 83%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/styles/magnific-popup.css (70 of 83 - 84%) 200 - http://www.theosborn.org/wp-content/themes/the-osborn/style.css?v=2.0 (71 of 83 - 86%) 200 - http://www.theosborn.org/wp-content/plugins/gd-worker/public/css/gd-worker-public.css?ver=1.0.0 (72 of 83 - 87%) 200 - http://www.theosborn.org/wp-content/plugins/slide-in/css/wdsi.css?ver=1.2 (73 of 83 - 88%) 200 - http://www.theosborn.org/wp-content/plugins/ultimate-branding/ultimate-branding-files/modules/custom-admin-bar-files/css/general.css?ver=1.0 (74 of 83 - 89%) 200 - http://www.theosborn.org/wp-content/plugins/ultimate-branding/ultimate-branding-files/modules/favicons/css/admin.css?ver=1.0.0 (75 of 83 - 90%) 200 - http://www.theosborn.org/wp-includes/wlwmanifest.xml (76 of 83 - 92%) error - http://www.theosborn.org/sitemap/ (77 of 83 - 93%) error - http://www.theosborn.org/contact-us/privacy-policy/ (78 of 83 - 94%) 200 - http://www.theosborn.org/wp-content/uploads/2015/10/door-knob_O_32x32.png (79 of 83 - 95%) error - http://www.theosborn.org/wp-json/ (80 of 83 - 96%) error - http://www.theosborn.org/xmlrpc.php?rsd (81 of 83 - 98%) error - http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F (82 of 83 - 99%) error - http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F&format=xml (83 of 83 - 100%) Crawling Done...

ERROR Crawled 83 urls with 16 error(s) in 52.14 seconds average response time: 0.89 seconds average process time: 0.01 seconds

Start URL(s): http://www.theosborn.org

error (timeout): http://www.theosborn.org/photo-gallery/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/event/alexandra-zapruder-author-twenty-six-seconds-personal-history-zapruder-film/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/history/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/westchester-county-assisted-living-apartments/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F&format=xml from http://www.theosborn.org

error (timeout): http://www.theosborn.org/xmlrpc.php?rsd from http://www.theosborn.org

error (timeout): http://www.theosborn.org/contact-us/privacy-policy/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/ from http://www.theosborn.org from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/2016/10/19/matt-anderson-presents-wellspring-osborn/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/additional-services/ from http://www.theosborn.org from http://www.theosborn.org

error (timeout): http://www.theosborn.org/contact-us-download/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/sitemap/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/scholarship/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.theosborn.org%2F from http://www.theosborn.org

error (timeout): http://www.theosborn.org/accreditation/ from http://www.theosborn.org

error (timeout): http://www.theosborn.org/wp-json/ from http://www.theosborn.org`

bartdag commented 7 years ago

So you are no longer crawling bad phone numbers (which was the expected behavior). Yay!

I'm getting these results (on python 2.7)

SUCCESS Crawled 81 urls in 14.91 seconds
  average response time: 0.78 seconds
  average process time: 0.34 seconds

You may want to increase the timeout with --timeout=20. The errors you see mean that pylinkvalidator does not get a response under 10 seconds.

jjtroberts commented 7 years ago

Understood. Thanks for adding this feature!