autogram-is / spidergram

Structural analysis tools for complex web sites
GNU General Public License v3.0
113 stars 4 forks source link

Detect infinite baseUrl explosions #42

Open eaton opened 1 year ago

eaton commented 1 year ago

In some situations where dynamic pages are generated based on the incoming URL, and links can be entered without a proper protocol, it's easy for some CMSs to generate infinite exploding URL trees. For example:

  1. http://example.com/event/{event-id} is used as a URL route
  2. Editors may create arbitrary sub-pages for the event, so any number of sub-urls are theoretically valid.
  3. Event ID parsing is permissive, discarding anything after the ID but not redirecting to the shorter, valid URL. For example, '1234' and '1234/foo' both render the page for '1234' but no redirection is performed.
  4. IF a malformed URL with no protocol is added to that page, browsers will treat the current page's URL as the base. For example, 'www.cnn.com' becomes 'http://example.com/event/1234/www.cnn.com';
  5. Following the malformed link, with the permissive ID parsing in effect, generates a new page that stacks the URL: 'http://example.com/event/1234/www.cnn.com/www.cnn.com'.
  6. Particularly when multiple protocol-less links are present, the size of the crawl explodes geometrically, quickly skewing page counts and bloating the dataset.

We need to figure out if there's a good way to detect these scenarios; even a brute force check is probably preferable to a hung crawl or (worse) a dataset that's trashed and has to be re-crawled.

eaton commented 1 year ago

This was encountered a second time with an endlessly repeating /~/~/~/~/~/~/~ URL; there are some expensive generalized solutions, and some error-prone heuristics; we might have to choose between them.

One potential solution is to check the referer's path, then the current URL's path. If the difference between them repeats 1-n times, consider it a dead-end looping URL. That won't help prevent the "first-tier" explosion of bad URLs, but can help us avoid following them infinitely-deep.

eaton commented 1 year ago

Simple checking of repeated URL elements is now in place and will ship with the 0.9.19 release; it catches the https://example.com/~/~/~ URLs, but isn't yet smart enough to identify the "missing protocol resulted in a full URL being appended to the baseURL" problem in the very first example that started this issue.

eaton commented 1 year ago

Fun twist: https://blog.education.nationalgeographic.org/2013/09/24/jerusalem-the-movie-explores-tolerance-themes shows off a special hellscape that generates infinitely expanding paths with an extra suffix. i.e., visiting:

/2013/09/24/jerusalem-the-movie-explores-tolerance-themes

Results in one of the links including the following URL:

/2013/09/24/jerusalem-the-movie-explores-tolerance-themes/NatGeoEd.org/Jerusalem

And visiting it results in one of the links including:

/2013/09/24/jerusalem-the-movie-explores-tolerance-themes/NatGeoEd.org/NatGeoEd.org/Jerusalem

Right now our recursion detector assumes that the duplication will only occur at the end of the URL; this needs to be reassessed.