Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Please provide a sample setup to crawl a website and store the content in Solr repo. #41

Closed raviks007 closed 9 years ago

raviks007 commented 10 years ago

Please provide a sample setup to crawl a website and store the content in Solr repo. Also we have other requirements like, indexing Metadata, skip certain URLs, parsing only part of a content page and also parsing the data from oracle database.

Is it possible to give a best example to help me to implement the above requirements. We are actually looking to finalise to go with Apache Nutch or Norconex. I have no experience in Norconex as I just read about it since yesterday. It will be helpful if you can provide the inputs so that I can showcase and decide on the crawler.

Thanks, Ravi

essiembre commented 10 years ago

Here is a minimalist configuration sample for Solr that should address your requirements (except for Oracle, explained further down).

<!-- You can define variables with the "set" directive to simplify your 
     config maintenance (using Apache Velocity syntax) -->
#set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")

<httpcollector id="Minimal Config HTTP Collector">
    <crawlers>
        <crawler id="Minimal Solr Config Example">
            <!-- Requires at least one start URL. -->
            <startURLs>
                <url>http://lucene.apache.org/solr/</url>
            </startURLs>

            <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
            <maxDepth>2</maxDepth>

            <!-- Be as nice as you can to sites you crawl. -->
            <delay default="1500" />

            <!-- At a minimum make sure you stay on your domain. -->
            <httpURLFilters>
                <filter class="$urlFilter"
                    onMatch="include">http://lucene.apache.org/solr/.*</filter>
                <filter class="$urlFilter"
                    onMatch="exclude">.+\.(png|jpg|jpeg|gif|ico|css|js)$</filter>
                <filter class="$urlFilter"
                    onMatch="exclude">.+\?.*</filter>
            </httpURLFilters>

            <importer>
                <postParseHandlers>
                    <!-- Unless you configured Solr to accept ANY fields, it will fail
                         when you try to add documents.  This "KeepOnlyTagger" ensures
                         to drop every field crawled except those you want. -->
                    <tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
                        fields="document.reference,title" />

                    <!-- Strip content you do not want -->
                    <transformer class="com.norconex.importer.transformer.impl.StripBetweenTransformer"
                          inclusive="true" caseSensitive="false" >
                        <stripBetween>
                            <start>&lt;!-- whatever start text, like a comment --&gt;</start>
                            <end>&lt;!-- whatever end text, like a comment --&gt;</end>
                        </stripBetween>
                        <!-- multiple strignBetween tags allowed -->
                    </transformer>                        

                    <!-- The importer has a lot of config options where you can define
                         constants, rename fields, manipulate your content, etc. -->
                </postParseHandlers>
            </importer>

            <!-- A "committer" dictates where the crawled content goes. -->
            <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>http://localhost:8983/solr/collection1</solrURL>
            </committer>

            <!-- When developing or troubleshooting, you can use the filesystem
                 committer so you can have a precise look at the content 
                 that would be sent to Solr. -->
            <!--
            <committer class="com.norconex.committer.impl.FileSystemCommitter">
                <directory>./examples-output/minimum/crawledFiles</directory>
            </committer>
            -->

        </crawler>
    </crawlers>
</httpcollector>

A lot more tuning options are available to you and a fast way to discover them all is to look at these summarized configuration documentation pages:

HTTP-specific options: http://www.norconex.com/product/collector-http/configuration.html Importing options: http://www.norconex.com/product/importer/configuration.html Solr-specific options: http://www.norconex.com/product/committer-solr/

For Oracle, we have not yet released a database collector (the day will come). So you have a few options, like having a web app to crawl on top of your database, or, extract the data from the database to files, and then use the Filesystem Collector to index that data. The Filesystem Collector supports the same importing options and the same Solr "committer". It's location: http://www.norconex.com/product/collector-filesystem/

Let us know if you need clarification on anything.

essiembre commented 10 years ago

Did you make good progress with the provided sample? Do you have trouble with something?

raviks007 commented 10 years ago

Hi Pascal, I have not yet checked. Will check next week and will let you know.

Thanks, Ravi On 17 Oct 2014 23:05, "Pascal Essiembre" notifications@github.com wrote:

Did you make good progress with the provided sample? Do you have trouble with something?

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/41#issuecomment-59576273 .

essiembre commented 9 years ago

Hello Ravi. We have not heard in a while so I am assuming you got the answer you were looking for. Please re-open this ticket or create another one if you have more questions/issues.

liar666 commented 8 years ago

I'm re-opening it since the provided example does not work:

  1. because of bug: https://github.com/Norconex/collector-http/issues/255
  2. it raises
com.norconex.collector.core.CollectorException: Cannot load crawler configurations.
        at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:93)
        at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:183)
        at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:76)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.collector.http.filter.impl.RegexReferenceFilter".
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:190)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:333)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:265)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:115)
        at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadReferenceFilters(AbstractCrawlerConfig.java:387)
        at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:314)
        at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
        at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:83)
        ... 4 more
Caused by: java.lang.ClassNotFoundException: com.norconex.collector.http.filter.impl.RegexURLFilter
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:188)
        ... 11 more
liar666 commented 8 years ago

According to minimal example, it has been move to: com.norconex.collector.core.filter.impl.RegexReferenceFilter

:)

shreya-singh-tech commented 7 years ago

hi , for making this xml file run what changes have to be done in collector-http.sh file. im trying to run it but produces this error collector-http.sh: 4: collector-http.sh: realpath: not found log4j:ERROR Could not read configuration file from URL [file:/log4j.properties]. java.io.FileNotFoundException: /log4j.properties (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at java.io.FileInputStream.(FileInputStream.java:101) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.(LogManager.java:127) at com.norconex.collector.core.AbstractCollector.(AbstractCollector.java:58) log4j:ERROR Ignoring configuration file [file:/log4j.properties]. Invalid configuration file path: /home/iitp/Downloads/norconex-collector-http-2.7.1/document-collector.xml

please help

essiembre commented 7 years ago

realpath is a command often present on your OS but you can install if not. Another option is simply to replace export ROOT_DIR=$(realpath $(dirname $0)) with the hardcoded path of the directory where collector-http.sh is found. Example:

export ROOT_DIR=/home/iitp/Downloads/norconex-collector-http-2.7.1

The reason for using realpath is for situations where you invoke the script from a different directory (then relative paths may be affected). Hard-coding the path as the same effect.