Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Java configuration and events #90

Closed yvesnyc closed 9 years ago

yvesnyc commented 9 years ago

Hi,

I am trying out collector-http-2.1.0 and committer-elasticsearch-2.0.1.

The command line works fine! But I am unable to use the Java api for my needs.

I would like to load the xml configuration and, before running, update some fields such as startURLs, and set the cluster.name (elasticsearch). Do you have an example that covers this? The first problem is that java example does not work: /* XML configuration: */ HttpCollectorConfig config = new CollectorConfigLoader(HttpCollectorConfig.class) .loadCollectorConfig(myXMLFile, myVariableFile);

Type HttpCollectorConfig is not compatible with the loadCollectorConfig result.

If I set the committer class as elasticsearch in the xml file, how do I update the cluster name in code? Better yet, can I use an environment variable reference in the .xml or .variables files? I still need to set the startUrl setting.

Finally, I would like to capture the CRAWLEND event in java as well. How do I set up a listener?

Thanks,

essiembre commented 9 years ago

You have a few questions mixed in there!

About configuring in the code, the documentation misses a cast. Try this:

HttpCollectorConfig collectorConfig = (HttpCollectorConfig) new CollectorConfigLoader(
        HttpCollectorConfig.class).loadCollectorConfig(myXMLFile, myVariableFile);

// assuming you have exactly one crawler defined
HttpCrawlerConfig crawlerConfig = (HttpCrawlerConfig) collectorConfig.getCrawlerConfigs()[0];

// set the start urls (string array)
crawlerConfig.setStartURLs(startURLs);

// get the elasticsearch committer you created in your XML config
ElasticsearchCommitter committer = (ElasticsearchCommitter) crawlerConfig.getCommitter();

Once you have a config object, you should be able to get and set whatever you like.

About the environment variable, they cannot be used in XML or .variables loaded from file (can be a feature request to add if you like). You can have multiple .variables file for your different sites though. Programmatically you can also do it directly on the config object you'll get with the above sample.

About adding a listener, here is an example using an anonymous inner class:

crawlerConfig.setCrawlerListeners(new ICrawlerEventListener[] {
        new ICrawlerEventListener () {
            @Override
            public void crawlerEvent(ICrawler crawler, CrawlerEvent event) {
                if (CrawlerEvent.CRAWLER_FINISHED.equals(event.getEventType())) {
                    // Done crawling... celebrate!
                }
            }
        }
});

Finally, when you are done configuring and are ready to launch...

HttpCollector collector = new HttpCollector(collectorConfig);
collector.start(false); // set true to "resume" a previously aborted/stopped job

Let me know if these answers work for you.

essiembre commented 9 years ago

I fixed the online code samples under the Getting Started pages (for both HTTP and Filesystem collectors). Thanks for reporting those documentation errors!

yvesnyc commented 9 years ago

Thanks @essiembre,

You answered all my questions. Sorry for so many at once.

I had looked over the API and saw everything I was looking for but lacked clarifying examples (which you provided).

I’m trying it out now.

One more question/clarification on HttpCollector’s behavior. It refreshes links over time and adds each new crawler’s startURLs to the permanent pool (in the “workdir”). Does the cron parameter allow refresh control?

Thanks again.

essiembre commented 9 years ago

Can you clarify what you mean by the "cron" parameter? What type of "refresh control" are you looking for? Can you give an example?

yvesnyc commented 9 years ago

I meant the “delay” configuration. It has a “” features (cron-like).

Lets say HttpCollector is run like a service and receives urls an starts crawlers each url. As each crawler finishes, the urls are added to the datastore in the workdir. My assumption is that HttpCollector automatically schedules crawlers to recheck links in the datastore. The frequency of this rechecking (if it happens) is what I am asking about.

essiembre commented 9 years ago

The delay schedules are not like a cron. They do not "trigger" anything so they do not "schedule" crawlers. You decide yourself when you want to re-run the HTTP Collector (manually, via your own cronjob, Windows Scheduled task, or else). They only dictate the delay between each "hits" (a web page being accessed) for each specified periods, for an already running HTTP Collector.

An example could be: you decide to have a 3 seconds delay between each hit during business hours (a delay schedule), and 0.5 second during off hours (another delay schedule) to minimize the impact on your sites.

The HTTP Collector is not always on and is not "listening" for new URLs. It has a start and end states. You have to restart it for it to crawl your sites again, and it will know if something got changed or modified according to the store database it keeps from the previous run.

Does it clarify things enough for you?

yvesnyc commented 9 years ago

Yes. I got it.

Thanks again

essiembre commented 9 years ago

No problem!