Norconex / committer-solr

Solr implementation of Norconex Committer. Should also work with any Solr-based products, such as LucidWorks.
https://opensource.norconex.com/committers/solr/
Apache License 2.0
3 stars 5 forks source link

Invalid Date String #11

Closed dkh7m closed 7 years ago

dkh7m commented 7 years ago

I'm getting the following error when I run my committer code:

Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/uvahealthPlone: Invalid Date String:'Wed, 24 May 2017 17:09:30 GMT'

I tried using the DateFormatTagger to convert the Response Header date to UTC (based on the example in the API), but it's not working. Granted, I'm only assuming the date is coming from the Response Header, b/c that's where I'm seeing that particular date format in my browser debugger.

Here is my XML code:

<httpcollector id="HIT Collector">

    <progressDir>./crawler-output/uvahealthPlone/progress</progressDir>
    <logsDir>./crawler-output/uvahealthPlone/logs</logsDir>

    <crawlerDefaults>
        <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
            <url>http://localhost:8448/</url>
        </startURLs>
        <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver" default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" >
            <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
        </delay>
        <numThreads>2</numThreads>
        <maxDepth>10</maxDepth>
        <maxDocuments>-1</maxDocuments>
        <workDir>./crawler-output/uvahealthPlone</workDir>
        <keepDownloads>false</keepDownloads>
        <orphansStrategy>DELETE</orphansStrategy>
        <crawlerListeners>
            <listener  
              class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
                <statusCodes>404</statusCodes>
                <outputDir>./crawler-output/uvahealthPlone</outputDir>
                <fileNamePrefix>brokenLinks</fileNamePrefix>
            </listener>
        </crawlerListeners>
        <referenceFilters>
            <filter onMatch="exclude" class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">
            js,jpg,gif,png,svg,ico,css
            </filter>
        </referenceFilters>
        <robotsTxt ignore="true" />
        <!-- Sitemap since 2.3.0: -->
        <sitemapResolverFactory ignore="true" />
        <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" />
        <documentFetcher  
          class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
          detectContentType="true" detectCharset="true">
            <validStatusCodes>200</validStatusCodes>
            <notFoundStatusCodes>404</notFoundStatusCodes>
        </documentFetcher>

        <importer>
            <preParseHandlers>
                <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.contentType">
                    <regex>.*text/html.*</regex>
                </filter>
                <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.contentType">
                    <regex>.*application/pdf.*</regex>
                </filter>
                <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                    <fields>date,title,portal_type,review_state,dockeywords,Description,document.reference,path_string,path_parents</fields>
                </tagger> 
                <tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
                      fromField="Date"
                      toField="solr_date" 
                      toFormat="yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" >
                    <fromFormat>EEE, dd MMM yyyy HH:mm:ss zzz</fromFormat>
                    <fromFormat>EPOCH</fromFormat>
                </tagger>
            </preParseHandlers>   
        </importer>

        <committer class="com.norconex.committer.core.impl.MultiCommitter">
            <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                <directory>./crawler-output/uvahealthPlone/crawledFiles</directory>
            </committer>
            <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>http://localhost:8983/solr/uvahealthPlone/</solrURL>
                <solrUpdateURLParams>
                    <param name="update"></param>
                    <param name="commit">true</param>
                </solrUpdateURLParams>
                <queueSize>1</queueSize>
                <commitBatchSize>1</commitBatchSize>
            </committer>
        </committer>
    </crawlerDefaults>

    <crawlers>
        <crawler id="HIT Crawler">
        </crawler>
    </crawlers>

</httpcollector>
essiembre commented 7 years ago

Looks like you may have discovered a bug when the importer is defined in the <crawlerDefaults> section. Until it is fixed, the workaround is easy: move all your configuration settings found within <crawlerDefaults> to your <crawler id="HIT Crawler"> . Since you only have 1 crawler defined, there is no benefit to using defaults anyway.

dkh7m commented 7 years ago

I've determined that the date field that is being pulled in is not coming from the response header. I set up my committer to log all fields that are being indexed. There is a "Date" field that matches what I'm seeing in the logs for each page that's indexed. So I put the following code in my XML, trying to delete the Date field and replace it with a custom meta tag field I created called solr_date.

<importer>
    <preParseHandlers>
        <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.contentType">
            <regex>.*text/html.*</regex>
        </filter>
        <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.contentType">
            <regex>.*application/pdf.*</regex>
        </filter>
        <tagger class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
            <fields>Date</fields>
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="solr_date" toField="Date" overwrite="true" />
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,portal_type,review_state,dockeywords,Description,document.reference,path_string,path_parents,Date</fields>
        </tagger>
    </preParseHandlers>
    <postParseHandlers/>
</importer>

But when I run this code, I'm still getting the invalid date error from the Solr committer. It doesn't appear that the Date field is being deleted or overwritten by my new tag.

I also confirmed that this is not happening on all of the crawled pages. Some of them are being committed to Solr. But it's only about 1/3 of them.

essiembre commented 7 years ago

Have you moved your config section out of the <crawlerDefaults> like suggested?

dkh7m commented 7 years ago

Yes. I moved the code into the section as suggested. Reposting my full XML code for reference:

<httpcollector id="HIT Collector">

    <progressDir>./crawler-output/uvahealthPlone/progress</progressDir>
    <logsDir>./crawler-output/uvahealthPlone/logs</logsDir>

    <crawlers>
        <crawler id="HIT Crawler">
            <startURLs 
                stayOnDomain="true" 
                stayOnPort="true" 
                stayOnProtocol="true">
                <url>http://localhost:8448/</url>
            </startURLs>
            <delay 
                class="com.norconex.collector.http.delay.impl.GenericDelayResolver" 
                default="2 seconds" 
                ignoreRobotsCrawlDelay="true" 
                scope="site">
                <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
            </delay>
            <numThreads>2</numThreads>
            <maxDepth>10</maxDepth>
            <maxDocuments>-1</maxDocuments>
            <workDir>./crawler-output/uvahealthPlone</workDir>
            <keepDownloads>false</keepDownloads>
            <orphansStrategy>DELETE</orphansStrategy>
            <crawlerListeners>
                <listener 
                    class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
                    <statusCodes>404</statusCodes>
                    <outputDir>./crawler-output/uvahealthPlone</outputDir>
                    <fileNamePrefix>brokenLinks</fileNamePrefix>
                </listener>
            </crawlerListeners>
            <referenceFilters>
                <filter 
                    onMatch="exclude" 
                    class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">
                    js,jpg,gif,png,svg,ico,css
                </filter>
            </referenceFilters>
            <robotsTxt 
                ignore="true" />
            <!-- Sitemap since 2.3.0: -->
            <sitemapResolverFactory 
                ignore="true" />
            <metadataFetcher 
                class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" />
            <documentFetcher  
                class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
                detectContentType="true" 
                detectCharset="true">
                <validStatusCodes>200</validStatusCodes>
                <notFoundStatusCodes>404</notFoundStatusCodes>
            </documentFetcher>

            <importer>
                <preParseHandlers>
                    <filter 
                        class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" 
                        onMatch="include" 
                        field="document.contentType">
                        <regex>.*text/html.*</regex>
                    </filter>
                    <filter 
                        class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" 
                        onMatch="include" 
                        field="document.contentType">
                        <regex>.*application/pdf.*</regex>
                    </filter>
                    <tagger 
                        class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
                        <fields>Date</fields>
                    </tagger>
                    <tagger 
                        class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                        <rename fromField="solr_date" toField="Date" overwrite="true" />
                    </tagger>
                    <tagger 
                        class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>title,portal_type,review_state,dockeywords,Description,document.reference,path_string,path_parents,Date</fields>
                    </tagger> 
                </preParseHandlers>   
            </importer>

            <committer class="com.norconex.committer.core.impl.MultiCommitter">
                <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                    <directory>./crawler-output/uvahealthPlone/crawledFiles</directory>
                </committer>
                <committer class="com.norconex.committer.solr.SolrCommitter">
                    <solrURL>http://localhost:8983/solr/uvahealthPlone/</solrURL>
                    <solrUpdateURLParams>
                        <param name="update"></param>
                        <param name="commit">true</param>
                    </solrUpdateURLParams>
                    <queueSize>1</queueSize>
                    <commitBatchSize>1</commitBatchSize>
                </committer>
            </committer>
        </crawler>
    </crawlers>

</httpcollector>
essiembre commented 7 years ago

Quickly looking at it, I suspect your issue may be tied to using <preParseHandlers> instead of <postParseHandlers>. See, parsing the document often extract several fields. So I would move your last 3 taggers to a <postParseHandlers> section and see what it gives.

essiembre commented 7 years ago

FYI, there is a new version (2.7.1) of the HTTP Collector that fixes the <importer> found in <crawlerDefaults> section not being inherited by <crawler>s. I still recommend you only use defaults when you need them since it is easy to get mixed up about what will be overwritten and what not.

essiembre commented 7 years ago

@dkh7m, Is this issue resolved for you?

dkh7m commented 7 years ago

Sorry for the delay. I was not able to determine the cause of the date errors. Even after reconfiguring the code they still showed up. I ended up using a data connector in Solr to index this particular site directly from the MSSQL database. But I still have other sites to index, so if the problem crops up on one of those I'll take another look.