Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Solr committer not working. #404

Closed dhock194 closed 6 years ago

dhock194 commented 6 years ago

I am trying to setup the Norconex collector to push to Solr. I have it connected to and populating Elasticsearch correctly, but I really need Solr working. The crawl will complete,but then error out on one of several issues when trying to commit to Solr. I have tried several example configs from your Github issues pages,but still not working.

Last I tried was the following:

        <!-- Be as nice as you can to sites you crawl. -->
        <delay default="1500" />

        <!-- At a minimum make sure you stay on your domain. -->
        <httpURLFilters>
            <filter class="$urlFilter"
                onMatch="include">http://www.aerospike.com/docs/.*</filter>
            <filter class="$urlFilter"
                onMatch="exclude">.+\.(png|jpg|jpeg|gif|ico|css|js)$</filter>
            <filter class="$urlFilter"
                onMatch="exclude">.+\?.*</filter>
        </httpURLFilters>
        <importer>
            <postParseHandlers>
                <!-- Unless you configured Solr to accept ANY fields, it will fail
                     when you try to add documents.  This "KeepOnlyTagger" ensures
                     to drop every field crawled except those you want. -->
                <tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
                    fields="document.reference,title" />
                <!-- The importer has a lot of config options where you can define
                     constants, rename fields, manipulate your content, etc. -->
            </postParseHandlers>
        </importer>
        <!-- A "committer" dictates where the crawled content goes. -->
        <committer class="com.norconex.committer.solr.SolrCommitter">
            <solrURL>http://localhost:8983/solr/stage_posts</solrURL>
        </committer>

        <!-- When developing or troubleshooting, you can use the filesystem
             committer so you can have a precise look at the content
             that would be sent to Solr. -->
        <!--
        <committer class="com.norconex.committer.impl.FileSystemCommitter">
            <directory>./examples-output/minimum/crawledFiles</directory>
        </committer>
        -->

    </crawler>
</crawlers>

Error this one is generating is: aused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/stage_posts: [doc=http://www.aerospike.com/docs/client/perl/install/] missing required field: PI

Is there a known working committer config for Solr I can use? So far all of the example commit configs I have found generate errors and do not complete the commit.

dgomesbr commented 6 years ago

You've probably declared PI as unique (so it's mandatory). Make sure it's always filled otherwise it'll fail. Looks like a Solr usage related instead of Norconex.

Can you log the output with <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logContent="true"/> so you see if's something is missing?

You could also output to a folder and analyze the data there to see why it's failing.

essiembre commented 6 years ago

Due to lack of feedback and because, as @dgomesbr pointed out, this is rather a Solr configuration issue, I am closing. Please open a new ticket under the Solr Committer project if you suspect a problem with it.