Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Records not getting added to Azure index #649

Closed davidfahy closed 3 years ago

davidfahy commented 4 years ago

Hello,

We have a crawler running that pulls results into an Azure Search index. There are a number of items that do not appear in the index even though it looks like the crawler logfile specifies that they should. I have added an example below.

I have also added the crawler config.

Note that other similar entries e.g. https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2017/pipeda-2017-001/ have the same logfile entries and do appear in the index.

Is there any reason why an entry would not appear in the index eventhough the log file says DOCUMENT_COMMITTED_ADD?

Log file entries

crawler2: 2019-11-21 09:24:49 INFO -    DOCUMENT_FETCHED: https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2019/pipeda-2019-002/
crawler2: 2019-11-21 09:24:49 INFO -    CREATED_ROBOTS_META: https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2019/pipeda-2019-002/
crawler2: 2019-11-21 09:24:50 INFO -    URLS_EXTRACTED: https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2019/pipeda-2019-002/
crawler2: 2019-11-21 09:24:50 INFO -    DOCUMENT_IMPORTED: https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2019/pipeda-2019-002/
crawler2: 2019-11-21 09:24:50 INFO -    DOCUMENT_COMMITTED_ADD: https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2019/pipeda-2019-002/

Crawler Configuration

<crawler id="crawler2">
    <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/</url>
    </startURLs>
    <userAgent>norconex-24jj0r0e0ss67kkv</userAgent>
    <workDir>./Osler-AccessPrivacy-output</workDir>
    <numThreads>50</numThreads>
    <maxDepth>50</maxDepth>
    <sitemapResolverFactory ignore="true"/>
    <delay default="5000"/>
    <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false">https://www\.priv\.gc\.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/.*</filter>
    </referenceFilters>
    <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">svg,png,css,js,jpg,mp3</filter>
    </documentFilters>
    <importer>
        <postParseHandlers>
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                <fields>document.reference, description, title</fields>
            </tagger>
            <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                <rename fromField="document.reference" toField="reference"/>
            </tagger>
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="replace">
                <constant name="ord_fed">1</constant>
                <constant name="ord_bc">0</constant>
                <constant name="ord_on">0</constant>
                <constant name="ord_alta">0</constant>
                <constant name="ord_other">0</constant>
                <constant name="cl_fed">0</constant>
                <constant name="cl_bc">0</constant>
                <constant name="cl_on">0</constant>
                <constant name="cl_alta">0</constant>
                <constant name="cl_other">0</constant>
                <constant name="gl_fed">0</constant>
                <constant name="gl_bc">0</constant>
                <constant name="gl_on">0</constant>
                <constant name="gl_alta">0</constant>
                <constant name="gl_other">0</constant>
            </tagger>
        </postParseHandlers>
    </importer>
    <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
        <endpoint>[AZURE_SEARCH_URL]</endpoint>
        <apiKey>[API_KEY]</apiKey>
        <indexName>[INDEX_NAME]</indexName>
        <disableReferenceEncoding>false</disableReferenceEncoding>
        <ignoreValidationErrors>true</ignoreValidationErrors>
        <ignoreResponseErrors>true</ignoreResponseErrors>
        <sourceReferenceField keep="true">reference</sourceReferenceField>
        <targetContentField>content</targetContentField>
        <queueDir>./Osler-AccessPrivacy-output/committer-queue</queueDir>
        <maxRetries>10</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
    </committer>
</crawler>
essiembre commented 4 years ago

You got this right: whenever you see DOCUMENT_COMMITTED_ADD it means a document is set to be sent to Azure. It does not mean Azure will accept them all. Can you check in your Azure logs for any hints? I see in your configuration you are ignoring the response/validation errors. Maybe set those to false and see what errors you get.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.