MayankPandey01 / BrokenLinkHijacker

A Fast Broken Link Hijacker Tool written in Python
MIT License
94 stars 16 forks source link

Needs work to actually find all broken links #2

Closed frederickjh closed 3 years ago

frederickjh commented 3 years ago

I have been using another broken link tester michaeltelford/broken_link_finder, but am currently looking for one that is portable.

I setup a test page with 19 broken links testing website broken link finders for the following test cases and combinations thereof:

I ran BrokenLinkHijacker against my test page.

Not sure where to start. A page with 19 broken links and it does not even find one.

MayankPandey01 commented 3 years ago

There are few reasons for the problem you are facing If you could provide the link of the Webpage you are testing then it will be more helpful

It found only 10 links

  • It currently only searches for link in "a href" and "img" tags (more tags will be added)

It search 4 broken links saying that it could not connect, then reported

  • These were the links which returned status code other than 404. This is useful to determine few dead social media links which either redirect or shows other error when a Non-Existing username is given.

NO BROKEN LINKS FOUND

  • In the broken link section only those links are shown which gives a 404 error. Other dead links are shown in "UNABLE TO CONNECT "

As this is the first release it has a few Bug. PR is always welcomed and appreciated.

frederickjh commented 3 years ago

The test page is now on GitHub here https://frederickjh.github.io/broken-link-test-website/ (The final slash is required to actually hit the page and not get GitHub's 404)

The 4 broken links it searched were not 404 but not found. These are links on a non-existent domain name.

Sorry, I think your logic is flawed about what constitutes a broken link. You seem to think that only a 404 is a broken link. There are a number of other HTTP error codes that should also be considered a broken link. Especially in the 400 and 500 range. In the 300 range if a link redirects too many times this should also be considered broken as a web browser will reject it. Safari seems to have the lowest at 16 redirects.

A broken link is a link that is unable to provide the resource it should link to.

MayankPandey01 commented 3 years ago

First of all, I would suggest you read the "readme.md", this will give you an idea of how this tool works. As this tool was built for Bug Bounty hunters it was made to work fast and with reliability. This is not an SEO tool that scans for links on your domain. Its main purpose is to find Dead OUTBOUND links. Now coming to your testing website. I will try to explain how this tool worked for your website.

There is a total of 20 links on your website. Below is the output I got

[!] UNABLE TO CONNECT: https://thisdomaindoesnotexist-thouthou.com/nonexistenimage.png [!] UNABLE TO CONNECT: https://thisdomaindoesnotexist-thouthou.com/badpage.html [!] UNABLE TO CONNECT: https://example.com/images/non-existing_logo.png [!] UNABLE TO CONNECT: https://example.com/brokenlink [*] NO BROKEN LINKS FOUND

[+] Total Inbound links: 6 [+] Total Outbound links: 5 [+] Total URLs: 11

Why only 11 links got detected?

  • Because only 11 links were proper links i.e 5 are relative links (This tool does not scan for relative URLs). The other 4 links were having queries in them and hence they were treated as the same despite having different queries. This happens because URLParser Parses the URL based on their directory not on their queries.

11 links were found and they all were dead so why only 4 links are displayed in the output?

  • This is because only the outbound links are checked whether they are dead or not. This is done because if we start to scan all the links (INBOUND + OUTBOUND) on a real target this will take an enormous amount of time. Usually, we have no control over the links that are hosted on the same domain so this can not be a big deal of security issue. So we only focus on the OUTBOUND links.

So why 4 in output when there are 5 outbound links.

  • Because you have put "example.com" in one of the links. It is an outbound link but not a dead one (This is an actual website)

No, I don't think that this program has flawed logic. Only 404 is not considered a broken link. There are 3 sections in which the Output is shown

You can combine all these sections and have an idea of what you are dealing with. If you think a feature can be added feel free to leave a PR.