Closed raviks007 closed 9 years ago
Here is a minimalist configuration sample for Solr that should address your requirements (except for Oracle, explained further down).
<!-- You can define variables with the "set" directive to simplify your
config maintenance (using Apache Velocity syntax) -->
#set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")
<httpcollector id="Minimal Config HTTP Collector">
<crawlers>
<crawler id="Minimal Solr Config Example">
<!-- Requires at least one start URL. -->
<startURLs>
<url>http://lucene.apache.org/solr/</url>
</startURLs>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>2</maxDepth>
<!-- Be as nice as you can to sites you crawl. -->
<delay default="1500" />
<!-- At a minimum make sure you stay on your domain. -->
<httpURLFilters>
<filter class="$urlFilter"
onMatch="include">http://lucene.apache.org/solr/.*</filter>
<filter class="$urlFilter"
onMatch="exclude">.+\.(png|jpg|jpeg|gif|ico|css|js)$</filter>
<filter class="$urlFilter"
onMatch="exclude">.+\?.*</filter>
</httpURLFilters>
<importer>
<postParseHandlers>
<!-- Unless you configured Solr to accept ANY fields, it will fail
when you try to add documents. This "KeepOnlyTagger" ensures
to drop every field crawled except those you want. -->
<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
fields="document.reference,title" />
<!-- Strip content you do not want -->
<transformer class="com.norconex.importer.transformer.impl.StripBetweenTransformer"
inclusive="true" caseSensitive="false" >
<stripBetween>
<start><!-- whatever start text, like a comment --></start>
<end><!-- whatever end text, like a comment --></end>
</stripBetween>
<!-- multiple strignBetween tags allowed -->
</transformer>
<!-- The importer has a lot of config options where you can define
constants, rename fields, manipulate your content, etc. -->
</postParseHandlers>
</importer>
<!-- A "committer" dictates where the crawled content goes. -->
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/collection1</solrURL>
</committer>
<!-- When developing or troubleshooting, you can use the filesystem
committer so you can have a precise look at the content
that would be sent to Solr. -->
<!--
<committer class="com.norconex.committer.impl.FileSystemCommitter">
<directory>./examples-output/minimum/crawledFiles</directory>
</committer>
-->
</crawler>
</crawlers>
</httpcollector>
A lot more tuning options are available to you and a fast way to discover them all is to look at these summarized configuration documentation pages:
HTTP-specific options: http://www.norconex.com/product/collector-http/configuration.html Importing options: http://www.norconex.com/product/importer/configuration.html Solr-specific options: http://www.norconex.com/product/committer-solr/
For Oracle, we have not yet released a database collector (the day will come). So you have a few options, like having a web app to crawl on top of your database, or, extract the data from the database to files, and then use the Filesystem Collector to index that data. The Filesystem Collector supports the same importing options and the same Solr "committer". It's location: http://www.norconex.com/product/collector-filesystem/
Let us know if you need clarification on anything.
Did you make good progress with the provided sample? Do you have trouble with something?
Hi Pascal, I have not yet checked. Will check next week and will let you know.
Thanks, Ravi On 17 Oct 2014 23:05, "Pascal Essiembre" notifications@github.com wrote:
Did you make good progress with the provided sample? Do you have trouble with something?
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/41#issuecomment-59576273 .
Hello Ravi. We have not heard in a while so I am assuming you got the answer you were looking for. Please re-open this ticket or create another one if you have more questions/issues.
I'm re-opening it since the provided example does not work:
com.norconex.collector.core.CollectorException: Cannot load crawler configurations.
at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:93)
at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:183)
at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:76)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.collector.http.filter.impl.RegexReferenceFilter".
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:190)
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:333)
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:265)
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:115)
at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadReferenceFilters(AbstractCrawlerConfig.java:387)
at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:314)
at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:83)
... 4 more
Caused by: java.lang.ClassNotFoundException: com.norconex.collector.http.filter.impl.RegexURLFilter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:188)
... 11 more
According to minimal example, it has been move to: com.norconex.collector.core.filter.impl.RegexReferenceFilter
:)
hi ,
for making this xml file run what changes have to be done in collector-http.sh file.
im trying to run it but produces this error
collector-http.sh: 4: collector-http.sh: realpath: not found
log4j:ERROR Could not read configuration file from URL [file:/log4j.properties].
java.io.FileNotFoundException: /log4j.properties (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.
please help
realpath
is a command often present on your OS but you can install if not. Another option is simply to replace export ROOT_DIR=$(realpath $(dirname $0))
with the hardcoded path of the directory where collector-http.sh
is found. Example:
export ROOT_DIR=/home/iitp/Downloads/norconex-collector-http-2.7.1
The reason for using realpath
is for situations where you invoke the script from a different directory (then relative paths may be affected). Hard-coding the path as the same effect.
Please provide a sample setup to crawl a website and store the content in Solr repo. Also we have other requirements like, indexing Metadata, skip certain URLs, parsing only part of a content page and also parsing the data from oracle database.
Is it possible to give a best example to help me to implement the above requirements. We are actually looking to finalise to go with Apache Nutch or Norconex. I have no experience in Norconex as I just read about it since yesterday. It will be helpful if you can provide the inputs so that I can showcase and decide on the crawler.
Thanks, Ravi