Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
Apache License 2.0
183 stars 68 forks source link

running collector-http examples with Solr #52

Closed csaezl closed 9 years ago

csaezl commented 9 years ago

I'd like to test "minimum" and "complex" examples with solr but not sure what changes to make to minimum-config.xml and complex-config.xml. I'm trying, at the same time Solr, so my repository is collection1 (C:\solr\example\solr\collection1). I've downloaded "norconex-committer-solr-2.0.0" and copied bin directory onto collector-http's. I'd appreciate some advice. Thanks Carlos

essiembre commented 9 years ago

Do I assume right that you meant the "lib" foler? The content of the lib folder in the norconex-commiter-solr-2.0.0 zip file should go in the lib folder of your HTTP Collector installation (i.e. Jars with Jars). If you find duplicate Jars (different version), you can delete the oldest ones.

Once you have done this, you can look here for configuration options. For instance, you want to replace this from the minimum-config.xml ...

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">

... to something like this (change localhost:8080 to match your Solr instance)...

  <committer class="com.norconex.committer.solr.SolrCommitter">
      <sourceReferenceField keep="false">document.reference</sourceReferenceField>

The target reference and content fields need to match what you have defined in your Solr config/schema for the Solr unique key and default fulltext field, respectively.

The source reference field is the default. "document.reference" is always a field of every document crawled, unless you explicitelly take it off.

Let me know if that works for you.

csaezl commented 9 years ago

I've run collector-http folowing your advice but it seems that I've got an error. The following text is an excerpt of the execution. If there is a way to send you the the full text, let me know. Thank you Carlos .................... INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.

INFO [AbstractFileQueueCommitter] Committing 2 files INFO [SolrCommitter] Sending 2 documents to Solr for update/deletion. ERROR [AbstractBatchCommitter] Could not commit batched operations. com.norconex.committer.core.CommitterException: Cannot index document batch to Solr. at com.norconex.committer.solr.SolrCommitter.commitBatch( at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch( at com.norconex.committer.core.AbstractBatchCommitter.commitComplete( at com.norconex.committer.core.AbstractFileQueueCommitter.commit( at com.norconex.collector.core.crawler.AbstractCrawler.execute( at com.norconex.collector.core.crawler.AbstractCrawler.doExecute( at com.norconex.collector.core.crawler.AbstractCrawler.startExecution( at com.norconex.jef4.job.AbstractResumableJob.execute( at com.norconex.jef4.suite.JobSuite.runJob( at com.norconex.jef4.suite.JobSuite.doExecute( at com.norconex.jef4.suite.JobSuite.execute( at com.norconex.collector.core.AbstractCollector.start( at com.norconex.collector.core.AbstractCollectorLauncher.launch( at com.norconex.collector.http.HttpCollector.main(

Caused by: java.lang.IllegalArgumentException: Illegal character in opaque partat index 5: http:\localhost:8939\solr\example\solr\collection1/update?wt=javabi n&version=2 at Source) at org.apache.http.client.methods.HttpPost.( at org.apache.solr.client.solrj.impl.HttpSolrServer.request( at org.apache.solr.client.solrj.impl.HttpSolrServer.request( at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process( at com.norconex.committer.solr.SolrCommitter.commitBatch( ... 13 more Caused by: Illegal character in opaque part at index 5: http:\localhost:8939\solr\example\solr\collection1/update?wt=javabin&versi on=2 at$ Source) at$Parser.checkChars(Unknown Source) at$Parser.parse(Unknown Source) at Source) ... 19 more ....................

essiembre commented 9 years ago

Can you paste the configuration portion you have for Solr? From the stacktrace, it seems to be that your Solr URL has an invalid character in it. Can it be you have not specified the protocol properly?

I see http:\ in the stacktrace, while it should be http://

Can you double-check that?

csaezl commented 9 years ago

I'm very, very, very, very sorry!!!!!!!. You were right. But I'm still getting errors. Thank you Carlos

.................... INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l ocalhost:8939/solr/example/solr/collection1, updateUrlParams={}, solrServerFacto ry=DefaultSolrServerFactory [server=null], com.norconex.committer.solr.SolrCommi tter@1851003[queueSize=100,docCount=6,queue=com.norconex.committer.core.impl.Fil eSystemCommitter@715c6f[directory=/optional/queue/path/],commitBatchSize=10,maxR etries=2,maxRetryWait=5000,operations=[],docCount=0,targetReferenceField=id,sour ceReferenceField=document.reference,keepSourceReferenceField=false,targetContent Field=text,sourceContentField=,keepSourceContentField=false]]) INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents. INFO [AbstractFileQueueCommitter] Committing 6 files INFO [SolrCommitter] Sending 6 documents to Solr for update/deletion. ERROR [AbstractBatchCommitter] Could not commit batched operations. com.norconex.committer.core.CommitterException: Cannot index document batch to Solr. at com.norconex.committer.solr.SolrCommitter.commitBatch( at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch( at com.norconex.committer.core.AbstractBatchCommitter.commitComplete( at com.norconex.committer.core.AbstractFileQueueCommitter.commit( at com.norconex.collector.core.crawler.AbstractCrawler.execute( at com.norconex.collector.core.crawler.AbstractCrawler.doExecute( at com.norconex.collector.core.crawler.AbstractCrawler.startExecution( at com.norconex.jef4.job.AbstractResumableJob.execute( at com.norconex.jef4.suite.JobSuite.runJob( at com.norconex.jef4.suite.JobSuite.doExecute( at com.norconex.jef4.suite.JobSuite.execute( at com.norconex.collector.core.AbstractCollector.start( at com.norconex.collector.core.AbstractCollectorLauncher.launch( at com.norconex.collector.http.HttpCollector.main(

Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8939/solr/example/solr/collection1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request( at org.apache.solr.client.solrj.impl.HttpSolrServer.request( at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process( at com.norconex.committer.solr.SolrCommitter.commitBatch( ... 13 more Caused by: Connection refused: connect at Method) at Source) at Source) at Source) at Source) at Source) at Source) at Source) at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket( at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection( at at org.apache.http.impl.client.DefaultRequestDirector.tryConnect( at org.apache.http.impl.client.DefaultRequestDirector.execute( at org.apache.http.impl.client.AbstractHttpClient.doExecute( at org.apache.http.impl.client.CloseableHttpClient.execute( at org.apache.http.impl.client.CloseableHttpClient.execute( at org.apache.http.impl.client.CloseableHttpClient.execute( at org.apache.solr.client.solrj.impl.HttpSolrServer.request( ... 16 more ....................

essiembre commented 9 years ago

It is trying to connect to this URL: http://localhost:8939/solr/example/solr/collection1

Have you tried contacting this URL in your browser, from the same computer that's running the HTTP Collector? What do you get?

It looks to me that URL is wrong. Should it be http://localhost:8939/solr/collection1 instead? (dropping solr/example/)

Having a copy of the relevant portion of your configuration would help.

csaezl commented 9 years ago

This is the access to Solr Admin from the browser.

09-02-2015 18-18-49

In the picture, to the right, you can see that the instance is at c:\solr\example\solr\collection1, although the url reads http://localhost:8983/solr/#/collection1

With http://localhost:8983/solr/collection1, in the browser, I get error 404. Anyway I've put http://localhost:8983/solr/collection1 in minimum-config-solr.xml, and seems to work (not sure) but still get errors.

Please, let me know what more information you need me to send you.

Thank you Carlos

excerpt from minimum-config-solr.xml

  <!-- Decide what to do with your files by specifying a Committer. -->
  <committer class="com.norconex.committer.solr.SolrCommitter">
    <sourceReferenceField keep="false">document.reference</sourceReferenceField>

INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l ocalhost:8983/solr/collection1, updateUrlParams={}, solrServerFactory=DefaultSol rServerFactory [server=null], com.norconex.committer.solr.SolrCommitter@de3cea[q ueueSize=100,docCount=9,queue=com.norconex.committer.core.impl.FileSystemCommitt er@6ba7bf[directory=/optional/queue/path/],commitBatchSize=10,maxRetries=2,maxRe tryWait=5000,operations=[],docCount=0,targetReferenceField=id,sourceReferenceFie ld=document.reference,keepSourceReferenceField=false,targetContentField=text,sou rceContentField=,keepSourceContentField=false]]) INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committin g documents. INFO [AbstractFileQueueCommitter] Committing 9 files INFO [SolrCommitter] Sending 9 documents to Solr for update/deletion. ERROR [AbstractBatchCommitter] Could not commit batched operations. com.norconex.committer.core.CommitterException: Cannot index document batch to Solr. at com.norconex.committer.solr.SolrCommitter.commitBatch( at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch( at com.norconex.committer.core.AbstractBatchCommitter.commitComplete( at com.norconex.committer.core.AbstractFileQueueCommitter.commit( at com.norconex.collector.core.crawler.AbstractCrawler.execute( at com.norconex.collector.core.crawler.AbstractCrawler.doExecute( at com.norconex.collector.core.crawler.AbstractCrawler.startExecution( at com.norconex.jef4.job.AbstractResumableJob.execute( at com.norconex.jef4.suite.JobSuite.runJob( at com.norconex.jef4.suite.JobSuite.doExecute( at com.norconex.jef4.suite.JobSuite.execute( at com.norconex.collector.core.AbstractCollector.start( at com.norconex.collector.core.AbstractCollectorLauncher.launch( at com.norconex.collector.http.HttpCollector.main(

Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=] unknown field 'Content-Length' at org.apache.solr.client.solrj.impl.HttpSolrServer.request( at org.apache.solr.client.solrj.impl.HttpSolrServer.request( at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process( at com.norconex.committer.solr.SolrCommitter.commitBatch(

... 13 more

essiembre commented 9 years ago

You are making progress. I can see content is sent to Solr now so your committer configuration is now OK.

This last error is about a field being sent to Solr, but not defined in your Solr Schema. This is a typical error with Solr, and it is fairly easy to fix. Here are two options:

Option 1) Add a wildcard field in your Solr schema.xml and Solr will automatically create a new Solr field for every crawled field sent its way.

Option 2) Tell HTTP Collector to only keep the fields you have configured in your Solr schema. You can do this easily by setting a KeepOnlyTagger class in the importer section of your configuration file. Like this:



        <!-- This is what you need to add: -->
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"
                fields="document.reference,id,title,myField,myOtherField,etc" >



The coma-separated list of fields you specify must exist in your Solr Config.

csaezl commented 9 years ago

Since I want to collect and archive in Solr web pages contents and files referenced in the web pages, I'm not sure I can decide the field names. I suppose it depends on the site web pages. What do you advise me?. I'm very new to crawlers and Solr

Just to follow on my test I'll try option 1.

Thank you Carlos

essiembre commented 9 years ago

Good idea while you are developing. You will have a clear picture of all fields captured by the crawl activities.

About page references. The HTTP Collector will store all URLs found in a document in a metadata field. That allows you to build search features, such as "find all pages that link to this URL".

There is another HTTP Collector feature you may want to turn on (it is off by default). That is, for every document, store which page linked to it (if many pages point to the same file, only one will be kept). You can enable this by having this config:

  <extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor" keepReferrerData="true">

Documentation on HtmlLinkExtractor can be found here.

csaezl commented 9 years ago

I have "minimum" web pages recorded on Solr and Have an idea of the great variety of fields. There is something that doesn't work as I have expected. I mean the texts on web pages. For example, on, every text should be a candidate to the index. Texts as:

"Congratulations! If you read this text from your target repository (e.g. file system, search engine, ...) it means that you successfully ran the Norconex HTTP Collector minimum example."

"We are excited that you are trying the Norconex HTTP Collector. This standalone web page was created to help you test your installation is running properly. Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you on the Norconex HTTP Collector web site"

How can this text be indexed?.

And, finally, I'd like pdf, word, etc. files, referenced in web pages, to be indexed. Could you give me any advice to get the files indexed on Solr?

Thank you very much. Carlos

essiembre commented 9 years ago

The text you mention should be in Solr. Please provide the following:

As for PDFs and other non-HTML files, they are picked up by default. So unless you explicitly exclude them somehow, you'll get them.

essiembre commented 9 years ago

Another thing... the field that you map the content to in Solr... did you define it with the "store" flag being true?

csaezl commented 9 years ago

The field:

   <dynamicField name="*"    type="string"  indexed="true"  stored="true" 

What configuration file?

The Select Solr URL:


The xml result:

<?xml version="1.0" encoding="UTF-8"?>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="indent">true</str>
    <str name="q">norconex</str>
    <str name="wt">xml</str>
<result name="response" numFound="3" start="0">
    <arr name="Content-Length">
    <arr name="Connection">
    <arr name="X-Powered-By">
    <arr name="Server">
    <str name="id"></str>
    <arr name="SITE">
      <str>Norconex Test Site</str>
    <arr name="collector.referenced-urls">
    <str name="author">Norconex Inc.</str>
    <str name="author_s">Norconex Inc.</str>
    <arr name="title">
      <str>Norconex HTTP Collector Test Page</str>
    <arr name="MS-Author-Via">
    <arr name="Date">
      <str>Mon, 09 Feb 2015 16:13:39 GMT</str>
    <arr name="Content-Location">
    <arr name="Content-Encoding">
    <arr name="collector.content-type">
    <arr name="document.contentFamily">
    <arr name="collector.content-encoding">
    <arr name="Content-Type">
      <str>text/html; charset=UTF-8</str>
    <arr name="document.contentType">
    <arr name="dc:title">
      <str>Norconex HTTP Collector Test Page</str>
    <arr name="collector.depth">
    <long name="_version_">1492660268137709568</long></doc>
    <arr name="Content-Length">
    <arr name="Connection">
    <arr name="X-Powered-By">
    <arr name="Server">
    <str name="id"></str>
    <arr name="SITE">
      <str>Norconex Test Site</str>
    <arr name="collector.referenced-urls">
    <str name="author">Norconex Inc.</str>
    <str name="author_s">Norconex Inc.</str>
    <arr name="title">
      <str>Norconex HTTP Collector Test Page</str>
    <arr name="MS-Author-Via">
    <arr name="Date">
      <str>Mon, 09 Feb 2015 16:13:36 GMT</str>
    <arr name="Content-Location">
    <arr name="Content-Encoding">
    <arr name="collector.content-type">
    <arr name="document.contentFamily">
    <arr name="collector.content-encoding">
    <arr name="Content-Type">
      <str>text/html; charset=UTF-8</str>
    <arr name="document.contentType">
    <arr name="dc:title">
      <str>Norconex HTTP Collector Test Page</str>
    <arr name="collector.depth">
    <long name="_version_">1492660268141903872</long></doc>
    <str name="id"></str>
    <arr name="title">
      <str>Norconex HTTP Collector Test Page</str>
    <long name="_version_">1492660268145049601</long></doc>

essiembre commented 9 years ago

I mean the HTTP Collector configuration file. I see from about you kept text as the field name where to store the content in Solr (defined in the <targetContentField>). Is this field explicitly defined in your Solr schema? I suspect it is, but it is not flagged to be stored.

csaezl commented 9 years ago

Your are right. In HTTP Collector, is "text" by defult, and in schema.xml, it is stated is that "content" is for highlighting document content and "text" to search the content. Sould I change stored= false for "text" or targetContentField>context?


   <!-- Main body of document extracted by SolrCell.
        NOTE: This field is not indexed by default, since it is also copied to "text"
        using copyField below. This is to save space. Use this field for returning and
        highlighting document content. Use the "text" field to search the content. -->

   <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

HTTP Collector configuration file:

<?xml version="1.0" encoding="UTF-8"?>
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   See the License for the specific language governing permissions and
   limitations under the License.
<!-- This configuration shows the minimum required and minimum recommended to 
     run a crawler.  
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->

    <crawler id="Norconex Minimum Test Page">

      <!-- === Minimum required: =========================================== -->

      <!-- Requires at least one start URL. -->

      <!-- === Minimum recommended: ======================================== -->

      <!-- Where the crawler default directory to generate files is. -->

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- At a minimum make sure you stay on your domain. -->
            onMatch="include" >

          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.solr.SolrCommitter">
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>


essiembre commented 9 years ago

It all depends what you want to do with the document content you crawl. Typically, you want to search on it and that's OK that Solr has the "text" field as indexed="true". If that's all you want to do, leave it as is.

If you do not want to search on it, but you would like to display the content to your application users, then make the target field "content" in the HTTP Collector config (or change the text field in Solr to be stored="true", indexed="false").

If you want to do both search and display it, you can leave it as is, but also mark the "text" field in your schema as stored="true".

After you change your Solr schema, if you experience issues, the safest is to wipe out the existing content in Solr, restart it, and index again.

csaezl commented 9 years ago

Thank you Carlos

essiembre commented 9 years ago

No problem. As it seems you got everything working now, I am closing this issue. Feel free to re-open if you encounter a related issue, or create a new issue.

Thanks for using the Norconex HTTP Collector and good luck with your project!

csaezl commented 9 years ago

One final question. I've realized that processing the "minimum" (the same for the "complex") test, only page "" is process. I supposed that page "" should have also been processed, becausd is referenced in "minimum.php". Isn't it the way the crawler is supposed to work? Carlos

essiembre commented 9 years ago

Your expectations are good, but your configuration does not match your expectations. :-)

The the sample configuration is limited to crawl only one page on purpose (since that's just a test). There are two configuration settings at play here:

I recommend you do not remove these, but change them instead to match the site you want to crawl. Put a reasonable max depth (e.g. 20), and change the reference filter to match the domain name you are crawling (unless you want to crawl the entire internet!).

I welcome your questions anytime and I am glad to see you are making good progress, but I would appreciate you create new tickets/issues for new questions. It will separate better your questions with answers being "on-topic" with your title, helping others find answers more easily when looking at the closed issue list.