Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

crawl Sitemap data using norconex #453

Closed Navaminavu closed 4 years ago

Navaminavu commented 6 years ago

Hi we have a URLs in a .xml site map where in some date data need to be also crawled along with the URLs from the site map. image this is the snapshot of site map and we need to crawl the Last Mod. data as well . Attaching the xml source code as well as our config xmlsource.txt

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="HTTPCollectorthecompany">

    #set($http = "com.norconex.collector.http")
    #set($core = "com.norconex.collector.core")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
    #set($http
    = "com.norconex.collector.http")
    #set($committer =
    "com.norconex.committer")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($sitemapFactory    = "${http}.sitemap.impl.StandardSitemapResolverFactory")

    <crawlerDefaults>
        <numThreads>2</numThreads>
        <orphansStrategy>DELETE</orphansStrategy>
        <delay default="500" />
        <robotsTxt ignore="false"
                class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider" />     

    </crawlerDefaults>
    <crawlers>

<crawler id="WWW_thecompany">
<sitemapResolverFactory ignore="true" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory"/>

<canonicalLinkDetector ignore="true" />
           <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
                 <url>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog</url>         
            </startURLs>

            <workDir>./output/thecompany</workDir>
            <maxDepth>6</maxDepth>
            <maxDocuments>-1</maxDocuments>
            <progressDir>./output/thecompany/progress</progressDir>
            <logsDir>./output/thecompany/logs</logsDir>

<referenceFilters>
   <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpeg,jpg,gif,png,ico,css,js</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >https://www\.ericsson\.com/thecompany.*</filter>
 <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >.*/assets/.*</filter>
   <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >.*archive\.ericsson\.net.*</filter>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude" caseSensitive="false" >.*/wp-content/.*</filter>

</referenceFilters>

          <robotsTxt ignore="true" 
     class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>

 <!--  <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
        <tags>
        <tag name="a" attribute="href" />
        <tag name="img" attribute="src" />
        </tags>
</extractor>
    </linkExtractors> -->

    <importer>
                <preParseHandlers>

                    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                        <dom selector="div#content" toField="post_content" extract="text" overwrite="false"/>
                    </tagger>

                    <tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger" keepProbabilities="true" fallbackLanguage="en"></tagger>       
                </preParseHandlers>
                <postParseHandlers>
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>document.reference,document.language,title,dcterms.type,description,og:description,content,dcterms.issued,dcterms.modified,geo.country,dcterms.identifier,HashTags,CategoryTags,dcterms.identifier,dcterms.created,DCS.dcsuri,news_content,article_content,ourportfolio_content,main_content,document.language,ebody_content,og:type,keywords,post_content
                        </fields>
                    </tagger>
                    <transformer
                            class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
                        <reduce>\s</reduce>
                        <reduce>\n</reduce>
                        <reduce>\s\n</reduce>
                    </transformer>
                </postParseHandlers>
            </importer> 
            <!-- <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
            <directory>./output/www_thecompany1/crawledFiles</directory>
             </committer> -->
            <documentChecksummer
                                 class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
                                 disabled="false"
                                 combineFieldsAndContent="true" keep="false">
                                         <sourceFields>                                       document.reference,document.language,title,dcterms.type,description,og:description,content,dcterms.issued,dcterms.modified,geo.country,dcterms.identifier,HashTags,CategoryTags,dcterms.identifier,dcterms.created,DCS.dcsuri,news_content,article_content,ourportfolio_content,main_content,document.language,ebody_content,og:type,keywords,post_content
                                         </sourceFields>
            </documentChecksummer>
            <spoiledReferenceStrategizer  class="com.norconex.collector.core.state.impl.GenericSpoiledReferenceStrategizer">
                        <mapping state="NOT_FOUND" strategy="DELETE" />
                        <mapping state="BAD_STATUS" strategy="DELETE" />
                        <mapping state="ERROR" strategy="IGNORE" />
            </spoiledReferenceStrategizer> 

                  <committer class="com.norconex.committer.i3.I3Committer">
                <i3Params>
                    <param name="indexURL">http://esesslx0790.ss.sw.ericsson.se:8080/rest/www_thecompany/documents.json  
                    </param>
                    <param name="Content-type">application/json;charset=utf-8</param>
                    <param name="custom_id">false</param>
                    <param name="id_prefix">product_catalog_</param>
                    <param name="id_field_name">ProductNumber</param>
                </i3Params>
                <commitBatchSize>10</commitBatchSize>
                <queueDir>./queue/thecompany/</queueDir>
                <queueSize>50</queueSize>
                <maxRetries>2</maxRetries>
                <maxRetryWait>500</maxRetryWait>
            </committer>  
        </crawler>
    </crawlers>
</httpcollector>
essiembre commented 6 years ago

You should get what you want by default. You are currently disabling sitemap support by ignoring it. You have to enable it by changing ignore to false:

<sitemapResolverFactory ignore="false" lenient="true" />

Then the sitemap date will be available as a metadata field called collector.sitemap-lastmod. You will need to add it to the list of fields you are keeping in your KeepOnlyTagger.

Navaminavu commented 6 years ago

Hi , I tried this and we are not getting any metadata of sitemap.Attaching the output in the link here in : https://drive.google.com/drive/folders/1-N_OllZRJUcqp0kgGoYDJInciQUN2w2_?usp=sharing

This is the current config we have:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="HTTPCollectorthecompany">

    #set($http = "com.norconex.collector.http")
    #set($core = "com.norconex.collector.core")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
    #set($http = "com.norconex.collector.http")
    #set($committer =   "com.norconex.committer")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($sitemapFactory    = "${http}.sitemap.impl.StandardSitemapResolverFactory")

    <crawlerDefaults>
        <numThreads>2</numThreads>
        <orphansStrategy>DELETE</orphansStrategy>
        <delay default="500" />
        <robotsTxt ignore="false" class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>     

    </crawlerDefaults>
    <crawlers>

<crawler id="WWW_thecompany">
<sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory"/>

<canonicalLinkDetector ignore="true" />
           <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
                 <sitemap>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/sitemap_index.xml</sitemap> 

                 <!--<sitemap>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/post-sitemap.xml</sitemap> 
                 <sitemap>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/category-sitemap.xml</sitemap>
                 <sitemap> https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/post_tag-sitemap.xml</sitemap>-->
            </startURLs>

            <workDir>./output/thecompany</workDir>
            <maxDepth>2</maxDepth>
            <maxDocuments>-1</maxDocuments>
            <progressDir>./output/thecompany/progress</progressDir>
            <logsDir>./output/thecompany/logs</logsDir>

<referenceFilters>
   <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpeg,jpg,gif,png,ico,css,js</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >https://www\.ericsson\.com/thecompany/.*</filter>
 <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >.*/assets/.*</filter>
   <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >.*archive\.ericsson\.net.*</filter>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude" caseSensitive="false" >.*/wp-content/.*</filter>
 </referenceFilters>

 <!--  <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
        <tags>
        <tag name="a" attribute="href" />
        <tag name="img" attribute="src" />
        </tags>
</extractor>
    </linkExtractors> -->

    <importer>
                <preParseHandlers>

                    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                        <dom selector="div#content" toField="post_content" extract="text" overwrite="false"/>
                    </tagger>

                    <tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger" keepProbabilities="true" fallbackLanguage="en"></tagger>       
                </preParseHandlers>
                <postParseHandlers>
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>collector.sitemap-lastmod,document.reference,document.language,title,dcterms.type,description,og:description,content,dcterms.issued,dcterms.modified,geo.country,dcterms.identifier,HashTags,CategoryTags,dcterms.identifier,dcterms.created,DCS.dcsuri,news_content,article_content,ourportfolio_content,main_content,document.language,ebody_content,og:type,keywords,post_content,
                        </fields>
                    </tagger>
                    <transformer
                            class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
                        <reduce>\s</reduce>
                        <reduce>\n</reduce>
                        <reduce>\s\n</reduce>
                    </transformer>
                </postParseHandlers>
            </importer> 
            <!-- <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
            <directory>./output/www_thecompany1/crawledFiles</directory>
             </committer> -->
            <documentChecksummer
                                 class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
                                 disabled="false"
                                 combineFieldsAndContent="true" keep="false">
                                         <sourceFields>                                       document.reference,document.language,title,dcterms.type,description,og:description,content,dcterms.issued,dcterms.modified,geo.country,dcterms.identifier,HashTags,CategoryTags,dcterms.identifier,dcterms.created,DCS.dcsuri,news_content,article_content,ourportfolio_content,main_content,document.language,ebody_content,og:type,keywords,post_content,collector.sitemap-lastmod
                                         </sourceFields>
            </documentChecksummer>
            <spoiledReferenceStrategizer  class="com.norconex.collector.core.state.impl.GenericSpoiledReferenceStrategizer">
                        <mapping state="NOT_FOUND" strategy="DELETE" />
                        <mapping state="BAD_STATUS" strategy="DELETE" />
                        <mapping state="ERROR" strategy="IGNORE" />
            </spoiledReferenceStrategizer> 

                 <!-- <committer class="com.norconex.committer.i3.I3Committer">
                <i3Params>
                    <param name="indexURL">http://esesslx0790.ss.sw.ericsson.se:8080/rest/www_thecompany/documents.json  
                    </param>
                    <param name="Content-type">application/json;charset=utf-8</param>
                    <param name="custom_id">false</param>
                    <param name="id_prefix">product_catalog_</param>
                    <param name="id_field_name">ProductNumber</param>
                </i3Params>
                <commitBatchSize>10</commitBatchSize>
                <queueDir>./queue/thecompany/</queueDir>
                <queueSize>50</queueSize>
                <maxRetries>2</maxRetries>
                <maxRetryWait>500</maxRetryWait>
                </committer>-->
                <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
            <directory>./output/thecompany/crawledFiles</directory>
            </committer>  
        </crawler>
    </crawlers>
</httpcollector>
Navaminavu commented 6 years ago

Hi Pascal can u help me with this issue

essiembre commented 6 years ago

I just tried your config and it works as expected. Many records have collector.sitemap-lastmod in them. See sample.zip for a few example.

From the 4 examples I attached, only one does not have the sitemap field. The reason is a duplicate in your sitemap. Have a look at the first entry here: https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/post-sitemap.xml

You can see https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog appears twice, the first time without a date. The crawler will only register the first one it finds.

There could be a few other reasons too that I have no identified. For instance, I see you have maxDepth being 2. I recommend you set it to 0 if you want to only crawl sitemap URLs, and not more. Else, URLs extracted in pages identified in your sitemap will be followed (2-level deep). Any document discovered outside your sitemap won't have the sitemap last modified date.

When you make config changes, it is also a good idea to clear your "workdir" to make sure the sitemap is recrawled fresh each time.

Does that help?

Navaminavu commented 6 years ago

Hi It did work for some URLs in the site map .but its not working for all.Some URLS which have date in the sitemap dont have date information in the collection.sitemap-lastmodified field .

essiembre commented 6 years ago

You are correct, this is also what I observed and I gave you the causes. Do you have cases that do not fit the causes I mention? Have you tried going 0-level deep and removing duplicate URLs from your sitemap.xml?