Closed peter-chan-hkmci closed 3 years ago
There are a few reasons this can happen. For instance, maybe a URL did not get updated because the sitemap indicated it did not change since the previous crawl. What do the logs say about those URLs?
To avoid the last modify date cause, I have duplicated the application and cleaned all of the caches, then run the test. However, no luck :(
What about the logs? Maybe increase the verbosity if you have to, and look for what happened to the missing URLs. With the proper log level, every URL encountered should have an entry in the logs.
Tried to increase the verbosity by changing the below loggers to DEBUG
# log4j.properties
log4j.logger.com.norconex.collector.http=DEBUG
log4j.logger.com.norconex.collector.core=DEBUG
But still cannot find the missing URLs, like https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13 (I search it by using keyword an517-51
)
I confirm that the URL above is in the sitemap.
Attached the full log webcrawler-ec_95_M2.log
I was able to reproduce with what you shared. It turns out having <image>
tags in your sitemap was making the parser to fail on <url>
entries having them. I fixed the sitemap parser and made a new snapshot release (v2.x). Please give it a try and confirm.
My client is using version 2.8.2-SNAPSHOT and found that some urls didn't updated in the search engine.
For example: https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13
Checked that the crawler didn't fetch this url but this url is included in the sitemap.
My client don't want to change the crawler program a lot.
Is there any workaround or hotfix for this version?
The config is in the below: