Closed Navaminavu closed 4 years ago
You should get what you want by default. You are currently disabling sitemap support by ignoring it. You have to enable it by changing ignore
to false
:
<sitemapResolverFactory ignore="false" lenient="true" />
Then the sitemap date will be available as a metadata field called collector.sitemap-lastmod
. You will need to add it to the list of fields you are keeping in your KeepOnlyTagger
.
Hi , I tried this and we are not getting any metadata of sitemap.Attaching the output in the link here in : https://drive.google.com/drive/folders/1-N_OllZRJUcqp0kgGoYDJInciQUN2w2_?usp=sharing
This is the current config we have:
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="HTTPCollectorthecompany">
#set($http = "com.norconex.collector.http")
#set($core = "com.norconex.collector.core")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
#set($http = "com.norconex.collector.http")
#set($committer = "com.norconex.committer")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory")
<crawlerDefaults>
<numThreads>2</numThreads>
<orphansStrategy>DELETE</orphansStrategy>
<delay default="500" />
<robotsTxt ignore="false" class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>
</crawlerDefaults>
<crawlers>
<crawler id="WWW_thecompany">
<sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory"/>
<canonicalLinkDetector ignore="true" />
<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
<sitemap>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/sitemap_index.xml</sitemap>
<!--<sitemap>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/post-sitemap.xml</sitemap>
<sitemap>https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/category-sitemap.xml</sitemap>
<sitemap> https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/post_tag-sitemap.xml</sitemap>-->
</startURLs>
<workDir>./output/thecompany</workDir>
<maxDepth>2</maxDepth>
<maxDocuments>-1</maxDocuments>
<progressDir>./output/thecompany/progress</progressDir>
<logsDir>./output/thecompany/logs</logsDir>
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpeg,jpg,gif,png,ico,css,js</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >https://www\.ericsson\.com/thecompany/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >.*/assets/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >.*archive\.ericsson\.net.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude" caseSensitive="false" >.*/wp-content/.*</filter>
</referenceFilters>
<!-- <linkExtractors>
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
<tags>
<tag name="a" attribute="href" />
<tag name="img" attribute="src" />
</tags>
</extractor>
</linkExtractors> -->
<importer>
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="div#content" toField="post_content" extract="text" overwrite="false"/>
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger" keepProbabilities="true" fallbackLanguage="en"></tagger>
</preParseHandlers>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>collector.sitemap-lastmod,document.reference,document.language,title,dcterms.type,description,og:description,content,dcterms.issued,dcterms.modified,geo.country,dcterms.identifier,HashTags,CategoryTags,dcterms.identifier,dcterms.created,DCS.dcsuri,news_content,article_content,ourportfolio_content,main_content,document.language,ebody_content,og:type,keywords,post_content,
</fields>
</tagger>
<transformer
class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
<reduce>\s</reduce>
<reduce>\n</reduce>
<reduce>\s\n</reduce>
</transformer>
</postParseHandlers>
</importer>
<!-- <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./output/www_thecompany1/crawledFiles</directory>
</committer> -->
<documentChecksummer
class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
disabled="false"
combineFieldsAndContent="true" keep="false">
<sourceFields> document.reference,document.language,title,dcterms.type,description,og:description,content,dcterms.issued,dcterms.modified,geo.country,dcterms.identifier,HashTags,CategoryTags,dcterms.identifier,dcterms.created,DCS.dcsuri,news_content,article_content,ourportfolio_content,main_content,document.language,ebody_content,og:type,keywords,post_content,collector.sitemap-lastmod
</sourceFields>
</documentChecksummer>
<spoiledReferenceStrategizer class="com.norconex.collector.core.state.impl.GenericSpoiledReferenceStrategizer">
<mapping state="NOT_FOUND" strategy="DELETE" />
<mapping state="BAD_STATUS" strategy="DELETE" />
<mapping state="ERROR" strategy="IGNORE" />
</spoiledReferenceStrategizer>
<!-- <committer class="com.norconex.committer.i3.I3Committer">
<i3Params>
<param name="indexURL">http://esesslx0790.ss.sw.ericsson.se:8080/rest/www_thecompany/documents.json
</param>
<param name="Content-type">application/json;charset=utf-8</param>
<param name="custom_id">false</param>
<param name="id_prefix">product_catalog_</param>
<param name="id_field_name">ProductNumber</param>
</i3Params>
<commitBatchSize>10</commitBatchSize>
<queueDir>./queue/thecompany/</queueDir>
<queueSize>50</queueSize>
<maxRetries>2</maxRetries>
<maxRetryWait>500</maxRetryWait>
</committer>-->
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./output/thecompany/crawledFiles</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
Hi Pascal can u help me with this issue
I just tried your config and it works as expected. Many records have collector.sitemap-lastmod
in them. See sample.zip for a few example.
From the 4 examples I attached, only one does not have the sitemap field. The reason is a duplicate in your sitemap. Have a look at the first entry here: https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog/post-sitemap.xml
You can see https://www.ericsson.com/thecompany/sustainability_corporateresponsibility/technology-for-good-blog
appears twice, the first time without a date. The crawler will only register the first one it finds.
There could be a few other reasons too that I have no identified. For instance, I see you have maxDepth
being 2. I recommend you set it to 0 if you want to only crawl sitemap URLs, and not more. Else, URLs extracted in pages identified in your sitemap will be followed (2-level deep). Any document discovered outside your sitemap won't have the sitemap last modified date.
When you make config changes, it is also a good idea to clear your "workdir" to make sure the sitemap is recrawled fresh each time.
Does that help?
Hi It did work for some URLs in the site map .but its not working for all.Some URLS which have date in the sitemap dont have date information in the collection.sitemap-lastmodified field .
You are correct, this is also what I observed and I gave you the causes. Do you have cases that do not fit the causes I mention? Have you tried going 0-level deep and removing duplicate URLs from your sitemap.xml?
Hi we have a URLs in a .xml site map where in some date data need to be also crawled along with the URLs from the site map. this is the snapshot of site map and we need to crawl the Last Mod. data as well . Attaching the xml source code as well as our config xmlsource.txt