bejean / crawl-anywhere

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
www.crawl-anywhere.com
Apache License 2.0
96 stars 38 forks source link

Solr is not updated via indexer #85

Closed pixel-paul closed 8 years ago

pixel-paul commented 9 years ago

I have tested on both the download archive and the VM appliance, and in neither case does the indexer update the solr index.

Reference: https://groups.google.com/forum/#!topic/crawl-anywhere/95ZC3CLYv8U

FireLizard commented 9 years ago

Same problem here -.-

bejean commented 9 years ago

Is there any error details in log/indexer.output or log/indexer.log ?

FireLizard commented 9 years ago

Logs log/indexer.log

Wed Jun 10 12:28:55 CEST 2015 - Loop :
    time (sec)                  = 0
    doc                         = 1
    time per doc (ms)           = 1
    docs per minute             = 60000
    memory (free / max / total) = 469325376 / 515375104 / 515375104

log/indexer.output

processing : /opt/crawler/cores/tmgs_sachsen_angebote/indexer_queue/j.1433925918938-24a29f25-6f54-4f8b-b671-8788907e23d4.xml
Config file = /opt/crawler/cores/tmgs/config/indexer/indexer.xml
log4j:WARN No appenders could be found for logger (org.apache.solr.client.solrj.impl.HttpClientUtil).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Steps to reproduce

...
POSTing file j.1433925914026-99b587c4-b647-4a2e-bea4-3ff9d0ef64fa.xml (application/xml) to [base]
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/techproducts/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int name="QTime">49</int></lst><lst name="error"><str name="msg">Unexpected &lt;doc&gt; tag without an &lt;add&gt; tag surrounding it.</str><int name="code">400</int></lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/techproducts/update
...

Different XML schema

XML files from CA looks like

<doc action="add" target_type="solr" target_url"...">
    <id>...</id>
    ...
</doc>

XML files from Solr 5.2.0 example (example/exampledocs/hd.xml) looks like

<add>
    <doc>
        <field name="id">...</field>
        <field name="...">...</field>
        ...
    </doc>
</add>

Are you planning an upgrade to Solr 5+ ?

pixel-paul commented 9 years ago

I struggled to determine much else apart from the xml format being different. The VM appliance is encountering the same issue.

pixel-paul commented 9 years ago

Is this project currently being shelved?

bejean commented 9 years ago

Hi,

We had a lot of other projects during the last few months.

Concerning your issue, did you use the solrconfig.xml and schema.xml files provided for Solr 4.10.4 ? I think you will to make some changes in order to make them work for Solr 5.x.

Regards

Dominique

pixel-paul commented 9 years ago

Hi Dominique,

Totally understand, me too!

The problem that I have occurs with Solr 4.10.4 - the issue also occurs in the virtual appliance.

Thanks,

Paul

pixel-paul commented 8 years ago

From what I have found, the issue was due to Tomcat being installed and running - disabling and removing Tomcat allowed Solr to be started correctly.