Continuous crawling through a queue

wolverline commented 6 years ago

I have a question regarding continuous crawling (or scheduling for that matter). I've read your post regarding the similar topics here: https://github.com/Norconex/collector-http/issues/93. But it doesn't seem to provide me with a clear answer for the following scenario.

Let's say there are more than 100 sites to crawl. Some are deep counting more than 3,000 pages and others are not so much. I can separate them into several config files and let them run simultaneously on a schedule. But what if a crawling would take a couple of days maybe more? It is hardly predictable if the site count keeps increasing and there are more sites to crawl. Because of this nature, I'd like to add a queuing system and let it run continuously; just as GSA does.

Is it possible to collect URLs to crawl and insert the list into DynamoDB (or MongoDB), or maybe just site URLs (starting point)? By reading these URLs from a DB with a checksum values, can I avoid recrawling the same pages repeatedly? I don't exactly know what GSA does regarding this, though; setting up a crawling plan with config files itself doesn't seem to be an option for me.

essiembre commented 6 years ago

Unfortunately, each "crawler" you create will run until they are done, and it is only when running them again that your sites will be crawled again. In other words, crawlers won't recrawl a website in the same crawling session.

Is your primary concern about not recrawling the same pages over and over? There are a few options for this. A few examples:

Configure a metadataFetcher and a metadataChecksummer to rely on the HTTP header information (e.g. Last-Modified date) to only download a page if it has changed. This is often unreliable though (as many websites always return the current date).
Rely on sitemap.xml when present to only recrawl listed pages after enough time has passed (as defined in the sitemap instructions).
Use a recrawlableResolver to establish yourself which type of pages you want to crawl more frequently than others.

In any case, unmodified documents should not be sent to your Committer on subsequent crawls (default behavior).

For scheduling, you can schedule a restart of a collector at any interval you like using your OS scheduler (e.g. every hour). If it is already running, the new instance will get an error and abort while the original one will simply keep running until completion. That is a way to simulate "continuous" running. You could also check yourself if it is still running before starting a new instance.

There is something you could try: specify a maxDocuments that is not too large. That way, it will force the crawler to stop earlier when that number is reached, and on next run, it will start from the beginning again, but should not count unmodified pages (to confirm). So every site in your crawler could be crawled regularly but it should go a bit further each time on larger sites.

Does that help?

wolverline commented 6 years ago

@essiembre Running through crawling id seems to help. But only partially in my case. My goal here is to create a continuous crawl process instead of adding a scheduler. As an interim solution, I'm writing a shell script piggybacking bash mkfifo in order to create queue-like work sequences. For this, I centralized the default config, divided custom config files and then put them in each folder.

It seems the hurdle is to commit a new or updated page to search engine with crawled data from a few big sites that have a very long crawl cycle. I can divide those sites into several crawl processes. But it may not be feasible since the URLs of those sites can be cross-referenced (could take longer to finish crawling). That is the reason I am focusing on checking the HTTP Header (Last-Modified). I think it basically becomes the same question I asked in my previous ticket (https://github.com/Norconex/collector-http/issues/459).

Any thoughts on this?

essiembre commented 6 years ago

If you are breaking it down into multiple instances, you can put the execution of each in a loop that waits for the current execution to finish to start another one.

You could also script the creation and execution of configs for each site as opposed to putting many together in a single crawler. It means several collectors will run simultaneously, so it may take more hardware resources to do so (especially if you have tons).

For big sites, you can divide them as you suggest, with different URL filtering patterns to avoid the overlap. But, unless you are running into hardware performance issues and you want to use different servers, it would likely just cause more overhead compared to using more threads in a single collector for tha big site.

wolverline commented 6 years ago

I decided to create a daemon process (with a PHP Daemon library) that passes config params to define on which site the shell script runs. The shell script contains the Norconex java commands. By this way, I can limit the number of process up to max. I also added a job scheduler (another PHP library) since each website has different crawl frequencies. It is an initial POC stage and it seems to be working fine.

As I mentioned earlier, I have to run multiple sites that have a bigger volume. As you suggested, I may have to divide them up. I am not too worried about resource overhead for now since the daemon process can control it (more or less). However it seems it will add some complication in creating a site config. I eventually have to hand over the final work to the Ops team so that I have to leave the config as plain as possible. In this respect, I doubt that it would be a good idea; I know this needs a different set of discussion, though.

Thanks again @essiembre for your insightful opinion.

essiembre commented 6 years ago

Glad you seem to have a solution. Keep in mind: with the Collector not only can you have shareable configuration fragments, but also variables defined in an external file (e.g. .properties file). That way, you could make it so your Ops team only has to update the variable file(s) to match the targeted environment(s).

Are you good with your original question? Can we close?

wolverline commented 6 years ago

I've divided the shareable portions from config files. In order to do that I put three files under "shared-configs" folder: default.xml, importer-default.xml, committer-default.xml. Then I overwrite the configs in each file located under collectors folder.

collectors -- site1 ---- site1.xml ---- site1.variables shared-configs -- default.xml -- default-committer.xml -- default-importer.xml

I think it is a similar to what you suggest in the document.

Norconex / crawlers

Continuous crawling through a queue #458