cjdhein / perseus

Graphical Web Crawler
1 stars 0 forks source link

Invalid URL starting with "http://://" #22

Closed yoavgil closed 5 years ago

yoavgil commented 6 years ago

The data crawler has an "Invalid URL" error for both DFS and BFS. The invalid URLs appear to begin with "http://://". This error occurs for many pages visited by the web crawler, but not the starting web page.

It appears that the crawler still correctly visits these web pages. However, the graph displays "Invalid, broken, or otherwise unreachable URL" for these pages. This occurs even if the visited page has child pages, and therefore cannot be a broken link.

Example error messages from starting web page "wikipedia.org":

Error <class 'requests.exceptions.InvalidURL'>: Invalid URL u'http://://he.wikipedia.org': No host supplied

Error <class 'requests.exceptions.InvalidURL'>: Invalid URL u'http://://commons.wikimedia.org': No host supplied

cjdhein commented 6 years ago

I believe I have this fixed in the crawler3 branch. I was prepending 'http://' to urls that began with '://' instead of just 'http'.

cjdhein commented 5 years ago

Fixed.