bartdag / pylinkvalidator

pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.
Other
142 stars 36 forks source link

Invalid IPv6 URL #14

Open jimpriest opened 9 years ago

jimpriest commented 9 years ago

When checking some URLs I get the following error:

error (<type 'exceptions.ValueError'>): Invalid IPv6 URL: 

Even though the URL is is not formatted unusually.

Scan http://verticalindustriesblog.redhat.com/ with depth=1 for some examples.

I may modify my fork to just ignore this error but I'm not sure there is a correct way to 'fix' it. Googling it seems like an issue with Python 2.7.x.

I see it both on 2.7.5 and 2.7.10.

bartdag commented 9 years ago

Hi Jim, just to be sure, are we talking about URLs such as http://[http//w.on24.com/r.htm?e=991027&s=1&k=DBEA8D7CD7CF38AE3A007AB5432DAC2B&partnerref=sapredhat found on this page: http://verticalindustriesblog.redhat.com/tune-in-red-hat-sap-and-tabb-group-discuss-high-performance-computing-its-growth-in-financial-services-and-its-shrinking-cost/

Trying this link on firefox raises a Server Not Found error (not even a 404). I guess, pylinkvalidator should report a sensible error when the URL is not parsable. Just want to make sure I'm not missing other cases.

jimpriest commented 9 years ago

I think what is happening is the WYSIWG editor is trying to fix these links by adding an extra http// in the mix. Not sure where the bracket is coming from but they aren't IPv6 urls.

So yes, I think maybe just a more generic 'unparsable url found' error message may be more useful?