Open eaton opened 1 year ago
This was encountered a second time with an endlessly repeating /~/~/~/~/~/~/~ URL; there are some expensive generalized solutions, and some error-prone heuristics; we might have to choose between them.
One potential solution is to check the referer's path, then the current URL's path. If the difference between them repeats 1-n times, consider it a dead-end looping URL. That won't help prevent the "first-tier" explosion of bad URLs, but can help us avoid following them infinitely-deep.
Simple checking of repeated URL elements is now in place and will ship with the 0.9.19 release; it catches the https://example.com/~/~/~
URLs, but isn't yet smart enough to identify the "missing protocol resulted in a full URL being appended to the baseURL" problem in the very first example that started this issue.
Fun twist: https://blog.education.nationalgeographic.org/2013/09/24/jerusalem-the-movie-explores-tolerance-themes
shows off a special hellscape that generates infinitely expanding paths with an extra suffix. i.e., visiting:
/2013/09/24/jerusalem-the-movie-explores-tolerance-themes
Results in one of the links including the following URL:
/2013/09/24/jerusalem-the-movie-explores-tolerance-themes/NatGeoEd.org/Jerusalem
And visiting it results in one of the links including:
/2013/09/24/jerusalem-the-movie-explores-tolerance-themes/NatGeoEd.org/NatGeoEd.org/Jerusalem
Right now our recursion detector assumes that the duplication will only occur at the end of the URL; this needs to be reassessed.
In some situations where dynamic pages are generated based on the incoming URL, and links can be entered without a proper protocol, it's easy for some CMSs to generate infinite exploding URL trees. For example:
We need to figure out if there's a good way to detect these scenarios; even a brute force check is probably preferable to a hung crawl or (worse) a dataset that's trashed and has to be re-crawled.