Closed csaezl closed 9 years ago
Do I assume right that you meant the "lib" foler? The content of the lib folder in the norconex-commiter-solr-2.0.0 zip file should go in the lib folder of your HTTP Collector installation (i.e. Jars with Jars). If you find duplicate Jars (different version), you can delete the oldest ones.
Once you have done this, you can look here for configuration options. For instance, you want to replace this from the minimum-config.xml ...
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./examples-output/minimum/crawledFiles</directory>
</committer>
... to something like this (change localhost:8080 to match your Solr instance)...
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8080/solr/collection1</solrURL>
<sourceReferenceField keep="false">document.reference</sourceReferenceField>
<targetReferenceField>id</targetReferenceField>
<targetContentField>text</targetContentField>
<commitBatchSize>10</commitBatchSize>
<queueDir>/optional/queue/path/</queueDir>
<queueSize>100</queueSize>
<maxRetries>2</maxRetries>
<maxRetryWait>5000</maxRetryWait>
</committer>
The target reference and content fields need to match what you have defined in your Solr config/schema for the Solr unique key and default fulltext field, respectively.
The source reference field is the default. "document.reference" is always a field of every document crawled, unless you explicitelly take it off.
Let me know if that works for you.
I've run collector-http folowing your advice but it seems that I've got an error. The following text is an excerpt of the execution. If there is a way to send you the the full text, let me know. Thank you Carlos .................... INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO [AbstractFileQueueCommitter] Committing 2 files INFO [SolrCommitter] Sending 2 documents to Solr for update/deletion. ERROR [AbstractBatchCommitter] Could not commit batched operations. com.norconex.committer.core.CommitterException: Cannot index document batch to Solr. at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:198) at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:178) at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:158) at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:249) at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:246) at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:207) at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:169) at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49) at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351) at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301) at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171) at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116) at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69) at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: java.lang.IllegalArgumentException: Illegal character in opaque partat index 5: http:\localhost:8939\solr\example\solr\collection1/update?wt=javabi
n&version=2
at java.net.URI.create(Unknown Source)
at org.apache.http.client.methods.HttpPost.
Can you paste the configuration portion you have for Solr? From the stacktrace, it seems to be that your Solr URL has an invalid character in it. Can it be you have not specified the protocol properly?
I see http:\ in the stacktrace, while it should be http://
Can you double-check that?
I'm very, very, very, very sorry!!!!!!!. You were right. But I'm still getting errors. Thank you Carlos
....................
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l
ocalhost:8939/solr/example/solr/collection1, updateUrlParams={}, solrServerFacto
ry=DefaultSolrServerFactory [server=null], com.norconex.committer.solr.SolrCommi
tter@1851003[queueSize=100,docCount=6,queue=com.norconex.committer.core.impl.Fil
eSystemCommitter@715c6f[directory=/optional/queue/path/],commitBatchSize=10,maxR
etries=2,maxRetryWait=5000,operations=[],docCount=0,targetReferenceField=id,sour
ceReferenceField=document.reference,keepSourceReferenceField=false,targetContent
Field=text,sourceContentField=
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8939/solr/example/solr/collection1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:500) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:195) ... 13 more Caused by: java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.connect0(Native Method) at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source) at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) at java.net.AbstractPlainSocketImpl.connect(Unknown Source) at java.net.PlainSocketImpl.connect(Unknown Source) at java.net.SocksSocketImpl.connect(Unknown Source) at java.net.Socket.connect(Unknown Source) at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:178) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:395) ... 16 more ....................
It is trying to connect to this URL: http://localhost:8939/solr/example/solr/collection1
Have you tried contacting this URL in your browser, from the same computer that's running the HTTP Collector? What do you get?
It looks to me that URL is wrong. Should it be http://localhost:8939/solr/collection1 instead? (dropping solr/example/)
Having a copy of the relevant portion of your configuration would help.
This is the access to Solr Admin from the browser.
In the picture, to the right, you can see that the instance is at c:\solr\example\solr\collection1, although the url reads http://localhost:8983/solr/#/collection1
With http://localhost:8983/solr/collection1, in the browser, I get error 404. Anyway I've put http://localhost:8983/solr/collection1 in minimum-config-solr.xml, and seems to work (not sure) but still get errors.
Please, let me know what more information you need me to send you.
Thank you Carlos
excerpt from minimum-config-solr.xml
<!-- Decide what to do with your files by specifying a Committer. -->
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/collection1</solrURL>
<sourceReferenceField keep="false">document.reference</sourceReferenceField>
<targetReferenceField>id</targetReferenceField>
<targetContentField>text</targetContentField>
<commitBatchSize>10</commitBatchSize>
<queueDir>/optional/queue/path/</queueDir>
<queueSize>100</queueSize>
<maxRetries>2</maxRetries>
<maxRetryWait>5000</maxRetryWait>
</committer>
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l
ocalhost:8983/solr/collection1, updateUrlParams={}, solrServerFactory=DefaultSol
rServerFactory [server=null], com.norconex.committer.solr.SolrCommitter@de3cea[q
ueueSize=100,docCount=9,queue=com.norconex.committer.core.impl.FileSystemCommitt
er@6ba7bf[directory=/optional/queue/path/],commitBatchSize=10,maxRetries=2,maxRe
tryWait=5000,operations=[],docCount=0,targetReferenceField=id,sourceReferenceFie
ld=document.reference,keepSourceReferenceField=false,targetContentField=text,sou
rceContentField=
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=http://www.norconex.com/product/collector-http-test/complex1.php] unknown field 'Content-Length' at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:195)
You are making progress. I can see content is sent to Solr now so your committer configuration is now OK.
This last error is about a field being sent to Solr, but not defined in your Solr Schema. This is a typical error with Solr, and it is fairly easy to fix. Here are two options:
Option 1) Add a wildcard field in your Solr schema.xml and Solr will automatically create a new Solr field for every crawled field sent its way.
Option 2) Tell HTTP Collector to only keep the fields you have configured in your Solr schema. You can do this easily by setting a KeepOnlyTagger
class in the importer
section of your configuration file. Like this:
<importer>
<postParseHandlers>
<!-- This is what you need to add: -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"
fields="document.reference,id,title,myField,myOtherField,etc" >
</tagger>
</postParseHandlers>
</importer>
The coma-separated list of fields you specify must exist in your Solr Config.
Since I want to collect and archive in Solr web pages contents and files referenced in the web pages, I'm not sure I can decide the field names. I suppose it depends on the site web pages. What do you advise me?. I'm very new to crawlers and Solr
Just to follow on my test I'll try option 1.
Thank you Carlos
Good idea while you are developing. You will have a clear picture of all fields captured by the crawl activities.
About page references. The HTTP Collector will store all URLs found in a document in a metadata field. That allows you to build search features, such as "find all pages that link to this URL".
There is another HTTP Collector feature you may want to turn on (it is off by default). That is, for every document, store which page linked to it (if many pages point to the same file, only one will be kept). You can enable this by having this config:
<extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor" keepReferrerData="true">
Documentation on HtmlLinkExtractor can be found here.
I have "minimum" web pages recorded on Solr and Have an idea of the great variety of fields. There is something that doesn't work as I have expected. I mean the texts on web pages. For example, on http://www.norconex.com/product/collector-http-test/minimum.php, every text should be a candidate to the index. Texts as:
"Congratulations! If you read this text from your target repository (e.g. file system, search engine, ...) it means that you successfully ran the Norconex HTTP Collector minimum example."
"We are excited that you are trying the Norconex HTTP Collector. This standalone web page was created to help you test your installation is running properly. Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you on the Norconex HTTP Collector web site"
How can this text be indexed?.
And, finally, I'd like pdf, word, etc. files, referenced in web pages, to be indexed. Could you give me any advice to get the files indexed on Solr?
Thank you very much. Carlos
The text you mention should be in Solr. Please provide the following:
As for PDFs and other non-HTML files, they are picked up by default. So unless you explicitly exclude them somehow, you'll get them.
Another thing... the field that you map the content to in Solr... did you define it with the "store" flag being true?
The field:
<dynamicField name="*" type="string" indexed="true" stored="true"
multiValued="true"/>
What configuration file?
The Select Solr URL:
http://localhost:8983/solr/collection1/select?q=norconex&wt=xml&indent=true
The xml result:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="indent">true</str>
<str name="q">norconex</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="3" start="0">
<doc>
<arr name="Content-Length">
<str>3623</str>
</arr>
<arr name="Connection">
<str>close</str>
</arr>
<arr name="X-Powered-By">
<str>PleskLin</str>
</arr>
<arr name="Server">
<str>Apache</str>
</arr>
<str name="id">http://www.norconex.com/product/collector-http-test/complex1.php</str>
<arr name="SITE">
<str>Norconex Test Site</str>
</arr>
<arr name="collector.referenced-urls">
<str>http://www.norconex.com/collectors/img/collector-http.png</str>
<str>http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png</str>
</arr>
<str name="author">Norconex Inc.</str>
<str name="author_s">Norconex Inc.</str>
<arr name="title">
<str>Norconex HTTP Collector Test Page</str>
</arr>
<arr name="MS-Author-Via">
<str>DAV</str>
</arr>
<arr name="Date">
<str>Mon, 09 Feb 2015 16:13:39 GMT</str>
</arr>
<arr name="Content-Location">
<str>http://www.norconex.com/product/collector-http-test/complex1.php</str>
</arr>
<arr name="Content-Encoding">
<str>UTF-8</str>
</arr>
<arr name="collector.content-type">
<str>text/html</str>
</arr>
<arr name="document.contentFamily">
<str>html</str>
</arr>
<arr name="collector.content-encoding">
<str>text/html</str>
</arr>
<arr name="Content-Type">
<str>text/html</str>
<str>text/html; charset=UTF-8</str>
</arr>
<arr name="document.contentType">
<str>text/html</str>
</arr>
<arr name="dc:title">
<str>Norconex HTTP Collector Test Page</str>
</arr>
<arr name="collector.depth">
<str>0</str>
</arr>
<long name="_version_">1492660268137709568</long></doc>
<doc>
<arr name="Content-Length">
<str>3623</str>
</arr>
<arr name="Connection">
<str>close</str>
</arr>
<arr name="X-Powered-By">
<str>PleskLin</str>
</arr>
<arr name="Server">
<str>Apache</str>
</arr>
<str name="id">http://www.norconex.com/product/collector-http-test/complex2.php</str>
<arr name="SITE">
<str>Norconex Test Site</str>
</arr>
<arr name="collector.referenced-urls">
<str>http://www.norconex.com/collectors/img/collector-http.png</str>
<str>http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png</str>
</arr>
<str name="author">Norconex Inc.</str>
<str name="author_s">Norconex Inc.</str>
<arr name="title">
<str>Norconex HTTP Collector Test Page</str>
</arr>
<arr name="MS-Author-Via">
<str>DAV</str>
</arr>
<arr name="Date">
<str>Mon, 09 Feb 2015 16:13:36 GMT</str>
</arr>
<arr name="Content-Location">
<str>http://www.norconex.com/product/collector-http-test/complex2.php</str>
</arr>
<arr name="Content-Encoding">
<str>UTF-8</str>
</arr>
<arr name="collector.content-type">
<str>text/html</str>
</arr>
<arr name="document.contentFamily">
<str>html</str>
</arr>
<arr name="collector.content-encoding">
<str>text/html</str>
</arr>
<arr name="Content-Type">
<str>text/html</str>
<str>text/html; charset=UTF-8</str>
</arr>
<arr name="document.contentType">
<str>text/html</str>
</arr>
<arr name="dc:title">
<str>Norconex HTTP Collector Test Page</str>
</arr>
<arr name="collector.depth">
<str>0</str>
</arr>
<long name="_version_">1492660268141903872</long></doc>
<doc>
<str name="id">http://www.norconex.com/product/collector-http-test/minimum.php</str>
<arr name="title">
<str>Norconex HTTP Collector Test Page</str>
</arr>
<long name="_version_">1492660268145049601</long></doc>
</result>
</response>
I mean the HTTP Collector configuration file. I see from about you kept text
as the field name where to store the content in Solr (defined in the <targetContentField>
). Is this field explicitly defined in your Solr schema? I suspect it is, but it is not flagged to be stored.
Your are right. In HTTP Collector,
<targetContentField>text</targetContentField>
<!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
HTTP Collector configuration file:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2014 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This configuration shows the minimum required and minimum recommended to
run a crawler.
-->
<httpcollector id="Minimum Config HTTP Collector">
<!-- Decide where to store generated files. -->
<progressDir>./examples-output/minimum/progress</progressDir>
<logsDir>./examples-output/minimum/logs</logsDir>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<!-- === Minimum required: =========================================== -->
<!-- Requires at least one start URL. -->
<startURLs>
<url>http://www.norconex.com/product/collector-http-test/minimum.php</url>
</startURLs>
<!-- === Minimum recommended: ======================================== -->
<!-- Where the crawler default directory to generate files is. -->
<workDir>./examples-output/minimum</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>0</maxDepth>
<!-- Be as nice as you can to sites you crawl. -->
<delay default="5000" />
<!-- At a minimum make sure you stay on your domain. -->
<referenceFilters>
<filter
class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
onMatch="include" >
http://www\.norconex\.com/product/collector-http-test/.*
</filter>
</referenceFilters>
<importer>
<postParseHandlers>
<!-- If your target repository does not support arbitrary fields,
make sure you only keep the fields you need. -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"
fields="title,keywords,description,document.reference"/>
</postParseHandlers>
</importer>
<!-- Decide what to do with your files by specifying a Committer. -->
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/collection1</solrURL>
<sourceReferenceField keep="false">document.reference</sourceReferenceField>
<targetReferenceField>id</targetReferenceField>
<targetContentField>text</targetContentField>
<commitBatchSize>10</commitBatchSize>
<queueDir>/optional/queue/path/</queueDir>
<queueSize>100</queueSize>
<maxRetries>2</maxRetries>
<maxRetryWait>5000</maxRetryWait>
</committer>
</crawler>
</crawlers>
</httpcollector>
It all depends what you want to do with the document content you crawl. Typically, you want to search on it and that's OK that Solr has the "text" field as indexed="true"
. If that's all you want to do, leave it as is.
If you do not want to search on it, but you would like to display the content to your application users, then make the target field "content" in the HTTP Collector config (or change the text field in Solr to be stored="true"
, indexed="false"
).
If you want to do both search and display it, you can leave it as is, but also mark the "text" field in your schema as stored="true"
.
After you change your Solr schema, if you experience issues, the safest is to wipe out the existing content in Solr, restart it, and index again.
Thank you Carlos
No problem. As it seems you got everything working now, I am closing this issue. Feel free to re-open if you encounter a related issue, or create a new issue.
Thanks for using the Norconex HTTP Collector and good luck with your project!
One final question. I've realized that processing the "minimum" (the same for the "complex") test, only page "http://www.norconex.com/product/collector-http-test/minimum.php" is process. I supposed that page "http://www.norconex.com/collectors/collector-http/configuration" should have also been processed, becausd is referenced in "minimum.php". Isn't it the way the crawler is supposed to work? Carlos
Your expectations are good, but your configuration does not match your expectations. :-)
The the sample configuration is limited to crawl only one page on purpose (since that's just a test). There are two configuration settings at play here:
maxDepth
is set to zero, which means it won't crawl any deeper than the URL(s) you provide (so 1 page only in this case).filter
set to only accept URLs that match the test page URL (http://www\.norconex\.com/product/collector-http-test/.*
).I recommend you do not remove these, but change them instead to match the site you want to crawl. Put a reasonable max depth (e.g. 20), and change the reference filter to match the domain name you are crawling (unless you want to crawl the entire internet!).
I welcome your questions anytime and I am glad to see you are making good progress, but I would appreciate you create new tickets/issues for new questions. It will separate better your questions with answers being "on-topic" with your title, helping others find answers more easily when looking at the closed issue list.
I'd like to test "minimum" and "complex" examples with solr but not sure what changes to make to minimum-config.xml and complex-config.xml. I'm trying, at the same time Solr, so my repository is collection1 (C:\solr\example\solr\collection1). I've downloaded "norconex-committer-solr-2.0.0" and copied bin directory onto collector-http's. I'd appreciate some advice. Thanks Carlos