Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
181 stars 68 forks source link

HP Collector interrupted #57

Closed csaezl closed 9 years ago

csaezl commented 9 years ago

While running HTTP Collector on a non-Norconex site, after collecting some thousand documents commited to Solr, I had to interrupt it. I interrupted the run closing the DOS box. After that, every new run, this time on Norconex site (minimum example) seems to try to commit to Solr an object with an address that belongs to the non-Norconex site. And it produces an error. I supossed that HTTP Collector was trying to resume the interrumted job, so I deleted the example-output directory. I run it again but got the same result. I also created the Solr repository from scratch and deleted its log (I don't really know if that helps). Any way, I run it and got the same result, the non-Norconex address trying to commit to Solr. The question: is there any other place where HTTP Collector keeps information about interrupted jobs, that I have to delete?. Do you think is a matter of HTTP Collector or perhaps Solr's. I can provide you the HTTP Collector log (with the error) and Solr log, where the error also appears. Thank you Carlos

essiembre commented 9 years ago

When you want to start fresh, it is best if you delete all generated folders/files. Specifically though, those steps should clear the previous progress for you:

Give this a try and let me know how that worked for you.

You can find here another view of where these config tags should go.

csaezl commented 9 years ago

Unfortunatly I have tried that and get the error. <logsDir>, <workDir> and <progressDir> are referred to ./examples-output/minimum2. I delete ./examples-output to begin a new run

essiembre commented 9 years ago

Not sure what is going on. Please attach your latest config so I can try to replicate.

csaezl commented 9 years ago

Here is the config file:

`<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and minimum recommended to 
     run a crawler.  
     -->
<httpcollector id="Minimum-3 Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum3/progress</progressDir>
  <logsDir>./examples-output/minimum3/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum-3 Test Page">

      <!-- === Minimum required: =========================================== -->

      <!-- Requires at least one start URL. -->
      <startURLs>
        <url>http://www.norconex.com/product/collector-http-test/minimum.php</url>
      </startURLs>

      <!-- === Minimum recommended: ======================================== -->

      <!-- Where the crawler default directory to generate files is. -->
      <workDir>./examples-output/minimum3</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>10</maxDepth>

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- At a minimum make sure you stay on your domain. -->
      <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" >
          http://www\.norconex\.com/.*
        </filter>
      </referenceFilters>

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.solr.SolrCommitter">
        <solrURL>http://localhost:8983/solr/collection1</solrURL>
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>content</targetContentField>
        <commitBatchSize>10</commitBatchSize>
        <queueDir>/optional/queue/path/</queueDir>
        <queueSize>100</queueSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>

I've deleted ./examples-output and run it again. Now I don't get documents with another site address. The documments are commited to Solr but with an error:

At the begining of the run:

log4j:ERROR Could not find value for key log4j.appender.INFO
log4j:ERROR Could not instantiate appender named "INFO".

After that all seems to work fine, get DOCUMENT_COMMITTED_ADD for the three documents, but an error occurs:

INFO  [AbstractCrawler] Norconex Minimum-3 Test Page: 100% completed (3 processe
d/3 total)
INFO  [AbstractCrawler] Norconex Minimum-3 Test Page: Crawler finishing: committ
ing documents.
INFO  [AbstractFileQueueCommitter] Committing 100 files
INFO  [SolrCommitter] Sending 10 documents to Solr for update/deletion.
ERROR [AbstractBatchCommitter] Could not commit batched operations.
com.norconex.committer.core.CommitterException: Cannot index document batch to S
olr.
        at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.j
ava:198)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatc
h(AbstractBatchCommitter.java:178)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndC
ommitIfReady(AbstractBatchCommitter.java:207)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(Abs
tractBatchCommitter.java:142)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(Abstrac
tFileQueueCommitter.java:238)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractC
rawler.java:246)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(Abstrac
tCrawler.java:207)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(Ab
stractCrawler.java:169)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJ
ob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector
.java:116)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(Abstract
CollectorLauncher.java:69)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 Illegal to have multiple roots (start tag in epilog?).
 at [row,col {unknown-source}]: [283,1096]
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServ
er.java:495)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServ
er.java:199)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Ab
stractUpdateRequest.java:117)
        at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.j
ava:195)
        ... 14 more
csaezl commented 9 years ago

I've stop Solr and started it again and found out that despite the error (three errors, one for each document), the three documents have been registered on Solr. I'm confused. What can I expect?. I'm not much confident.

essiembre commented 9 years ago

Which version of Solr do you use? On which app server? That error has been reported before on Solr JIRA: https://issues.apache.org/jira/browse/SOLR-5402

Do you have huge documents? Maybe try increasing you app server max post size, or try lowering the commitBatchSize to 1 just to see if that helps.

If all fails, it may be a communication problem between the Solrj client version and your Solr version.

The Solr libraries used by the Solr Committer at this time are for version 4.7.0. If you have a more recent version, it may be that you need to update the Solr JARs found with your Collector installation.

In the lib folder, replace the following with versions matching your Solr version:

If you are using zookeeper, consider upgrading zookeeper-3.4.5 too.

If the library upgrade fixes it for you, I'll upgrade the libraries in a new release of the Solr Committer.

csaezl commented 9 years ago

I'm not sure who replaces whom. From your list of files, only solr-solrj-4.7.0.jar and zookeeper-3.4.5 appear in Norconex-collector-http-2.0.2\lib.

I was using version 4.10.1, but I have 5.0.0 installed now.

So, do you mean that I only have to upgrade solr-solrj-4.7.0.jar with solr-solrj-5.0.0.jar?. Not sure if I'm using zookeeper, but I can upgrade it any way.

csaezl commented 9 years ago

I'm using jetty, that comes with solr distribution.

essiembre commented 9 years ago

I suggest you upgrade anything solr in the lib folder with the more recent versions. If you do not know whether you are using zookeeper or not, that means you are not. :-)

Not sure if 5.0 API is backward compatible though. You'll soon discover.

csaezl commented 9 years ago

in C:\Norconex-collector-http-2.0.2\lib folder there are 85 elements. The activity of matching by hand every one with the ones in different folder under solr-5.0.0 is quite error prone. Is there a more secure way?.

essiembre commented 9 years ago

I have not tested myself yet with Solr 5, but you don't have to try reconcile everything. Start with just replacing the solr-solrj-4.7.0.jar with the 5.0 equivalent and see what it gives. Despite any error, make sure to check if docs appear in Solr.

If that does not work, either you try to troubleshoot more, try to upgrade the committer code yourself (if necessary), or we make this issue a feature request for a next release (to support Solr 5).

csaezl commented 9 years ago

I've upgraded to solr-solrj-5.0.0 and got that:

INFO  [AbstractCrawlerConfig] Reference filter loaded: com.norconex.collector.co
re.filter.impl.RegexReferenceFilter@79497d11[onMatch=INCLUDE,caseSensitive=false
,pattern=http://www\.norconex\.com/.*,regex=http://www\.norconex\.com/.*]
Exception in thread "main" java.lang.VerifyError: Bad type on operand stack in m
ethod com.norconex.committer.solr.SolrCommitter$DefaultSolrServerFactory.createS
olrServer(Lcom/norconex/committer/solr/SolrCommitter;)Lorg/apache/solr/client/so
lrj/SolrServer; at offset 39
        at com.norconex.committer.solr.SolrCommitter.<init>(SolrCommitter.java:1
22)
        at com.norconex.committer.solr.SolrCommitter.<init>(SolrCommitter.java:1
14)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:374)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(Config
urationUtil.java:175)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(Config
urationUtil.java:292)
        at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(Config
urationUtil.java:258)
        at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML
(AbstractCrawlerConfig.java:318)
        at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerCo
nfig(CrawlerConfigLoader.java:123)
        at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerCo
nfigs(CrawlerConfigLoader.java:83)
        at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(Abstr
actCollectorConfig.java:176)
        at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig
(CollectorConfigLoader.java:78)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(Abstract
CollectorLauncher.java:65)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
essiembre commented 9 years ago

Not sure what can cause that other than jar incompatibilities maybe. I am marking this as a feature request to support Solr 5.

csaezl commented 9 years ago

Thank you. Anyway, i first got Illegal to have multiple roots (start tag in epilog?) after interrupting a HTTP Collector run. Before that I had run your two examples and the same examples against other web sites, and run well. That with version 4.10.3.

csaezl commented 9 years ago

Doing more testing, I've found out a folder structure on my disk c:\optional\queue\path\add | remove. I suppose that it is fed by HTTP Collector. Is that right?.

Before running HTTP Collector if I delete the structure (with the content from a previous run), HTTP Collector seems to finish without errors. After all the tests I've done for days (even weeks), there where "add" and "remove" for the same reference in some cases. Perhaps that could be the reason for the error: Illegal to have multiple roots (start tag in epilog?), The description for this error, as seen in some sites on the internet, is : "SOLR-1752 SolrJ fails with exception when passing document ADD and DELETEs in the same request using XML request writer (but not binary request writer). I don't know if both things are related. Do you have information on how manage that structure?. Thank you Carlos

essiembre commented 9 years ago

As you can see here, add and deletes are both added to the same request. This is to ensure the order of operation is preserved, which may be important in some cases (e.g. deleting the same document before or after it was added is an importatn difference).

It never was an issue because the binary mode is the default behavior. I suspect the Solrj API sees that the client and server versions do not match, then it may fallback to XML for whatever reason (if there is not another cause for this).

That's why replacing the solrj jars to match your solr server versions would have been the easiest. Otherwise, when we'll upgrade the library to support Solr5, we should also compare the client/server versions and change how operations are added to the request if using XML stream (having deletes and adds handled separately separate).

You can overwrite the Solr Committer code to fix it yourself until an official release gets out.

csaezl commented 9 years ago

Thank you for your information and your support. Concerning version 5, before installing it I suppose that client and server never matched since I was using version 4.10.3 Thanks again Carlos

essiembre commented 9 years ago

2.0.1-SNAPSHOT version of Solr Committer was successfully tested with Solr 3.x, 4.x, and 5.0. You can get it here.

essiembre commented 9 years ago

Please follow progress here: https://github.com/Norconex/committer-solr/issues/1