Closed github-il closed 4 years ago
No, it is not the expected behavior. I noticed this line in your logs though:
INFO [JobSuite] Previous execution detected.
If you previously ran it with a different domain configured, it is possible those were cached and are being re-verified. You can change the default "orphan strategy" if so. A safer approach when you make a significant change to the config (or when unsure), make sure you clean the work directory (especially the crawl store), and try again. Let me know if that fixes it. Else, can you please share your config?
Pascal,
Thanks for your update.
I removed the work directory and re-ran and I still see the same behavior.
I have attached the config file (canonical.xml) and the apache index.html page I am using as well as the log file (debug.log), in the following zip file. files.zip
Thanks
I had time to reproduce and provide a fix. You can find it in the latest snapshot. You may have to start fresh (has it will try to reprocess orphans by default).
Please confirm.
Hi, What is the expected behavior when you encounter a canonical link in a document which points to another domain, and you have stayOnDomain set to true?
I'm seeing that the canonical link is followed, however this seems counter-intuitive to me.
I setup an Ubuntu VM and installed Apache and the http-collector (2.8.1) software, and added a canoncial link into the default index.html page as shown below ;
I then modified the minimum-config.xml to point to the local apache server, and depth to 1. The example has stayOnDomain to be true.
However when I run the crawl I see that the canonical link is followed ;
Is this a bug, or expected behavior ?
Thanks.