Open ishandutta2007 opened 4 years ago
Hi,
It's a really nice idea. The major drawback I can see is that we can miss some new page in previously crawled pages.
But with some work (like preload all links previously crawled to avoid refetching) we can implement that kind of feature.
The issue with this tool is once it halts, your have to start all over again from scratch. And with large sites this is a very common scenario. Since we already have the partially generated xml, it would be nice to continue from where it was interrupted. Let me know your thoughts on this and how to achieve this, I am willing to send pull request once I have a better understanding of the code