Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Cannot crawl all urls from a sitemap #758

Closed peter-chan-hkmci closed 3 years ago

peter-chan-hkmci commented 3 years ago

My client is using version 2.8.2-SNAPSHOT and found that some urls didn't updated in the search engine.

For example: https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13

Checked that the crawler didn't fetch this url but this url is included in the sitemap.

My client don't want to change the crawler program a lot.

Is there any workaround or hotfix for this version?

The config is in the below:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="application-webcrawler-ec">
    <progressDir>./output/progress</progressDir>
    <logsDir>./output/logs</logsDir>

    <crawlers>

        <crawler id="webcrawler-ec_M2">

            <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="true" stayOnProtocol="true">
                <sitemap>https://store.acer.com/sitemaps/DE/sitemap.xml</sitemap>
            </startURLs>

            <workDir>./output</workDir>
            <maxDepth>1</maxDepth>
            <userAgent>gsa-crawler</userAgent>
            <sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory">
                <path>https://store.acer.com/</path>
            </sitemapResolverFactory>

            <numThreads>16</numThreads>
            <delay default="0" scope="thread" />

            <documentFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
          ^https://store\.acer\.com/[^/-]+-[^/-]+/.*
                </filter>

                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
          ^https://store\.acer\.com/[^/-]+-[^/-]+/$
                </filter>
            </documentFilters>

            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">^https://store\.acer\.com/[^/-]+-[^/-]+/.*</filter>
            </referenceFilters>

            <importer>
                <preParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
                        <restrictTo field="document.contentType">text/html</restrictTo>
                        <stripBetween>
                            <start><![CDATA[<!--googleoff: index-->]]></start>
                            <end><![CDATA[<!--googleon: index-->]]></end>
                        </stripBetween>
                    </transformer>
                </preParseHandlers>
                <postParseHandlers>
                    <filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter" onMatch="exclude" fields="productPN" />

                    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="noop">
                        <constant name="collection">ec</constant>
                        <constant name="language"></constant>
                        <constant name="country"></constant>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
                        <replace fromField="document.reference" toField="language" regex="true">
                            <fromValue><![CDATA[^https:\/\/store\.acer\.com\/([^/-]+)-([^/-]+)\/.*]]></fromValue>
                            <toValue>$1</toValue>
                        </replace>
                        <replace fromField="document.reference" toField="country" regex="true">
                            <fromValue><![CDATA[^https:\/\/store\.acer\.com\/([^/-]+)-([^/-]+)\/.*]]></fromValue>
                            <toValue>$2</toValue>
                        </replace>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>document.reference,document.contentType,collection,language,country,title,description,keywords,robots,viewport,sectionName,productPN,price,sq,productGroup,quickSpecs,productImage</fields>
                    </tagger>
                </postParseHandlers>
            </importer>

            <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
                <directory>./crawled</directory>
                <pretty>false</pretty>
                <docsPerFile>1000</docsPerFile>
                <compress>false</compress>
                <splitAddDelete>true</splitAddDelete>
            </committer>
        </crawler>
    </crawlers>

</httpcollector>
essiembre commented 3 years ago

There are a few reasons this can happen. For instance, maybe a URL did not get updated because the sitemap indicated it did not change since the previous crawl. What do the logs say about those URLs?

peter-chan-hkmci commented 3 years ago

To avoid the last modify date cause, I have duplicated the application and cleaned all of the caches, then run the test. However, no luck :(

essiembre commented 3 years ago

What about the logs? Maybe increase the verbosity if you have to, and look for what happened to the missing URLs. With the proper log level, every URL encountered should have an entry in the logs.

peter-chan-hkmci commented 3 years ago

Tried to increase the verbosity by changing the below loggers to DEBUG

# log4j.properties
log4j.logger.com.norconex.collector.http=DEBUG
log4j.logger.com.norconex.collector.core=DEBUG

But still cannot find the missing URLs, like https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13 (I search it by using keyword an517-51) logan517-51

I confirm that the URL above is in the sitemap. an517-51

Attached the full log webcrawler-ec_95_M2.log

essiembre commented 3 years ago

I was able to reproduce with what you shared. It turns out having <image> tags in your sitemap was making the parser to fail on <url> entries having them. I fixed the sitemap parser and made a new snapshot release (v2.x). Please give it a try and confirm.