EXT was created to list extensions that should not be searched for links. The problem was the these files like movies and images were being loaded when page discovery was searching for links. So, if the ext is in the EXT tuple, that link is noted but never analyzed for deeper links because it is a document.
However, I see that there are now .php and .html etc tags in there which is causing the program to skip looking for deeper links, it just notes all of the first level links without looking further.
My suggestion, if we want the pageguessing algorithm to use EXT but also use exts like html, php... create a second tuple and in the pageguessing, instead of using EXT use (EXT + pageEXT) where pageEXT = ('.html', '.htm', '.php', ...)
This will concatenate the two but still keep them separate for me to use the unwanted extensions list.
EXT was created to list extensions that should not be searched for links. The problem was the these files like movies and images were being loaded when page discovery was searching for links. So, if the ext is in the EXT tuple, that link is noted but never analyzed for deeper links because it is a document.
However, I see that there are now .php and .html etc tags in there which is causing the program to skip looking for deeper links, it just notes all of the first level links without looking further.
My suggestion, if we want the pageguessing algorithm to use EXT butalso use exts like html, php... create a second tuple and in the pageguessing, instead of using EXT use (EXT + pageEXT) where pageEXT = ('.html', '.htm', '.php', ...)
This will concatenate the two but still keep them separate for me to use the unwanted extensions list.