Closed dkh7m closed 7 years ago
Looks like you may have discovered a bug when the importer is defined in the <crawlerDefaults>
section. Until it is fixed, the workaround is easy: move all your configuration settings found within <crawlerDefaults>
to your <crawler id="HIT Crawler">
. Since you only have 1 crawler defined, there is no benefit to using defaults anyway.
I've determined that the date field that is being pulled in is not coming from the response header. I set up my committer to log all fields that are being indexed. There is a "Date" field that matches what I'm seeing in the logs for each page that's indexed. So I put the following code in my XML, trying to delete the Date field and replace it with a custom meta tag field I created called solr_date.
<importer>
<preParseHandlers>
<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.contentType">
<regex>.*text/html.*</regex>
</filter>
<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.contentType">
<regex>.*application/pdf.*</regex>
</filter>
<tagger class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
<fields>Date</fields>
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
<rename fromField="solr_date" toField="Date" overwrite="true" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,portal_type,review_state,dockeywords,Description,document.reference,path_string,path_parents,Date</fields>
</tagger>
</preParseHandlers>
<postParseHandlers/>
</importer>
But when I run this code, I'm still getting the invalid date error from the Solr committer. It doesn't appear that the Date field is being deleted or overwritten by my new tag.
I also confirmed that this is not happening on all of the crawled pages. Some of them are being committed to Solr. But it's only about 1/3 of them.
Have you moved your config section out of the <crawlerDefaults>
like suggested?
Yes. I moved the code into the
<httpcollector id="HIT Collector">
<progressDir>./crawler-output/uvahealthPlone/progress</progressDir>
<logsDir>./crawler-output/uvahealthPlone/logs</logsDir>
<crawlers>
<crawler id="HIT Crawler">
<startURLs
stayOnDomain="true"
stayOnPort="true"
stayOnProtocol="true">
<url>http://localhost:8448/</url>
</startURLs>
<delay
class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
default="2 seconds"
ignoreRobotsCrawlDelay="true"
scope="site">
<schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
</delay>
<numThreads>2</numThreads>
<maxDepth>10</maxDepth>
<maxDocuments>-1</maxDocuments>
<workDir>./crawler-output/uvahealthPlone</workDir>
<keepDownloads>false</keepDownloads>
<orphansStrategy>DELETE</orphansStrategy>
<crawlerListeners>
<listener
class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
<statusCodes>404</statusCodes>
<outputDir>./crawler-output/uvahealthPlone</outputDir>
<fileNamePrefix>brokenLinks</fileNamePrefix>
</listener>
</crawlerListeners>
<referenceFilters>
<filter
onMatch="exclude"
class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">
js,jpg,gif,png,svg,ico,css
</filter>
</referenceFilters>
<robotsTxt
ignore="true" />
<!-- Sitemap since 2.3.0: -->
<sitemapResolverFactory
ignore="true" />
<metadataFetcher
class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" />
<documentFetcher
class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
detectContentType="true"
detectCharset="true">
<validStatusCodes>200</validStatusCodes>
<notFoundStatusCodes>404</notFoundStatusCodes>
</documentFetcher>
<importer>
<preParseHandlers>
<filter
class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
onMatch="include"
field="document.contentType">
<regex>.*text/html.*</regex>
</filter>
<filter
class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
onMatch="include"
field="document.contentType">
<regex>.*application/pdf.*</regex>
</filter>
<tagger
class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
<fields>Date</fields>
</tagger>
<tagger
class="com.norconex.importer.handler.tagger.impl.RenameTagger">
<rename fromField="solr_date" toField="Date" overwrite="true" />
</tagger>
<tagger
class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,portal_type,review_state,dockeywords,Description,document.reference,path_string,path_parents,Date</fields>
</tagger>
</preParseHandlers>
</importer>
<committer class="com.norconex.committer.core.impl.MultiCommitter">
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./crawler-output/uvahealthPlone/crawledFiles</directory>
</committer>
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/uvahealthPlone/</solrURL>
<solrUpdateURLParams>
<param name="update"></param>
<param name="commit">true</param>
</solrUpdateURLParams>
<queueSize>1</queueSize>
<commitBatchSize>1</commitBatchSize>
</committer>
</committer>
</crawler>
</crawlers>
</httpcollector>
Quickly looking at it, I suspect your issue may be tied to using <preParseHandlers>
instead of <postParseHandlers
>. See, parsing the document often extract several fields. So I would move your last 3 taggers to a <postParseHandlers
> section and see what it gives.
FYI, there is a new version (2.7.1) of the HTTP Collector that fixes the <importer>
found in <crawlerDefaults>
section not being inherited by <crawler>
s. I still recommend you only use defaults when you need them since it is easy to get mixed up about what will be overwritten and what not.
@dkh7m, Is this issue resolved for you?
Sorry for the delay. I was not able to determine the cause of the date errors. Even after reconfiguring the code they still showed up. I ended up using a data connector in Solr to index this particular site directly from the MSSQL database. But I still have other sites to index, so if the problem crops up on one of those I'll take another look.
I'm getting the following error when I run my committer code:
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/uvahealthPlone: Invalid Date String:'Wed, 24 May 2017 17:09:30 GMT'
I tried using the DateFormatTagger to convert the Response Header date to UTC (based on the example in the API), but it's not working. Granted, I'm only assuming the date is coming from the Response Header, b/c that's where I'm seeing that particular date format in my browser debugger.
Here is my XML code: