Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Importer Handlers ignored #338

Closed BorisGuenther closed 7 years ago

BorisGuenther commented 7 years ago

I´d like to setup a crawler to feed my Solr instances.

This is my setup:

Configuration

Solr

https://github.com/Norconex/committer-solr/tree/master/norconex-committer-solr/src/test/java/com/norconex/committer/solr and Additional dynamic field:

  <dynamicField name="*" type="string"  indexed="true"  stored="true" multiValued="true"/>

Norconex

<httpcollector id="master">

    <logsDir>${basedir}/logs</logsDir>

    <crawlers>
        <crawler id="${crawler_id}">
            <startURLs>
                <url>${url_site}</url>
            </startURLs>

            <maxDepth>2</maxDepth>
            <delay default="500" />
            <robotsTxt ignore="true"/>
            <sitemapResolverFactory ignore="true" />
            <numThreads>5</numThreads>
            <maxDocuments>2</maxDocuments>

            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false">${url_site}.*</filter>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude" caseSensitive="false">.+\.(png|jpg|jpeg|gif|ico|css|js)$</filter>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false">${url_fileadmin}.*</filter>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude" caseSensitive="false">.+\?.*</filter>
            </referenceFilters>

            <importer>
                <preParseHandlers>

                    <transformer class="com.norconex.importer.handler.transformer.impl.StripBeforeTransformer" inclusive="true">
                        <stripBeforeRegex><![CDATA[TYPO3SEARCH_begin]]></stripBeforeRegex>
                    </transformer>

                    <transformer class="com.norconex.importer.handler.transformer.impl.StripAfterTransformer" inclusive="true">
                        <stripAfterRegex><![CDATA[TYPO3SEARCH_end]]></stripAfterRegex>
                    </transformer>

                    <tagger class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
                        <fieldsRegex>^[Xx]-.*</fieldsRegex>
                    </tagger>

                </preParseHandlers>

                <postParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.StripBeforeTransformer" inclusive="true">
                        <stripBeforeRegex><![CDATA[TYPO3SEARCH_begin]]></stripBeforeRegex>
                    </transformer>

                    <transformer class="com.norconex.importer.handler.transformer.impl.StripAfterTransformer" inclusive="true">
                        <stripAfterRegex><![CDATA[TYPO3SEARCH_end]]></stripAfterRegex>
                    </transformer>

                    <tagger class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
                        <fieldsRegex>^[Xx]-.*</fieldsRegex>
                    </tagger>

                </postParseHandlers>
            </importer>                                                                  

            <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>http://localhost:8983/solr/${solr_core}</solrURL>
            </committer>

        </crawler>
    </crawlers>
</httpcollector>

Issues:

First:

The importer settings are completely ignored. If I setup an unknown tagger class it throws an error - so I guess it takes care of the config but it does not really process it.

Second:

The import into Solr fails with the message below.

com.norconex.committer.core.CommitterException: Cannot index document batch to Solr.
    at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:253)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
    at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:266)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:227)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:188)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:349)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:300)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:172)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:123)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:80)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/norconex: Exception writing document id http://domain-to-be-indexed.demo/ to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 10, 10, 9, 10, 9, 9, 10, 9, 9, 10, 9, 10, 9, 9, 70, 73, 83, 67, 72, 69, 82, 32, 119, 101, 108, 116, 119, 101, 105]...', original message: bytes can be at most 32766 in length; got 39118. Perhaps the document has an indexed string field (solr.StrField) which is too large
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:590)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
    at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:233)
    ... 14 more

Thank you in advance for your help.

BR Boris

essiembre commented 7 years ago

First: I tested your config and the importer section was invoked as it should. If you expect results that are not happening, maybe it is your regular expressions that are not matching properly? Do you have a URL you can share with a specific use case to reproduce?

Second: Your error comes from Solr. You are submitting a value that is too long for a Solr field of type "string". Make it a text field instead may resolve the issue.

BorisGuenther commented 7 years ago

Good morning Pascal,

thank you for your fast reply. One night shift later I could 95% solve the problems.

Second:

I could reduce the error by striping my content. But I´ll try your solution additionally as it seems to solve the problem better.

First:

It was a misunderstanding of mine. The problem ist that I wanted to strip content by HTML comments. If I do it pre parsing the meta tags are also gone... If I do it after parsing the comments I wanted to rely on are gone as the content is converted to plaintext.

I fixed it with an two-step-soultion:

  1. Convert the comments to plaintext so the survive parsing
  2. Strip around the plaintext comments

    Maybe you can give some feedback about my solution?

]]> TYPO3SEARCH_begin ]]> TYPO3SEARCH_end TYPO3SEARCH_begin TYPO3SEARCH_end
essiembre commented 7 years ago

If only sleeping could resolve all problems! :-)

What if you strip what is before your opening tag, but keep the header. Have you tried something like this?

<preParseHandlers>
    <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
      <stripBetween>
          <start><![CDATA[<body ]]></start>
          <end><![CDATA[<!--TYPO3SEARCH_begin-->]]></end>
      </stripBetween>
    </transformer>
    <transformer class="com.norconex.importer.handler.transformer.impl.StripAfterTransformer" inclusive="true">
        <stripAfterRegex><![CDATA[TYPO3SEARCH_end]]></stripAfterRegex>
    </transformer>
</preParseHandlers>
essiembre commented 7 years ago

Closing for lack of feedback.