dtulyakov / googlesitemapgenerator

Automatically exported from code.google.com/p/googlesitemapgenerator
0 stars 0 forks source link

Web Server Filter Includes 404 Pages in Sitemap #148

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Set up GSMG for domain with Web Server Filter set.
2. Go to any non-existent page on the domain which returns a 404.
3. Check sitemap file and you will see the non-existent page url.

What is the expected output? What do you see instead?
Pages that give 404s should not appear in the sitemap.

What version of the product are you using? On what operating system?
sitemap_linux-i386-beta1-20091231.tar.gz
Centos 5.5

Please provide any additional information below.
This enormous bug make each sitemap continually grow larger with 404 pages. 
This is a huge problem. Can someone from Google please give some attention to 
this project?

Original issue reported on code.google.com by pastordanwalker@gmail.com on 24 Nov 2010 at 5:14

GoogleCodeExporter commented 8 years ago
This is a serious issue, and is impacting the crawler results on our site. Are 
there any workarounds or fixes for this issue which we can use?

Thanks for your inputs

Original comment by amit.aro...@gmail.com on 20 May 2011 at 10:29

GoogleCodeExporter commented 8 years ago
You can try to apply this patch as a workaround.
src/common/basefilter.cc

Original comment by okaren...@gmail.com on 30 Sep 2011 at 4:02

Attachments:

GoogleCodeExporter commented 8 years ago
Thanks for the patch. I have applied the patch and recompiled the code on 32 
bit Centos 4.2 and installed the new version of the sitemap generator but I'm 
still getting 404 pages added to the sitemap. Any suggestions?

Original comment by rob.ba...@gmail.com on 15 Jul 2012 at 2:26

GoogleCodeExporter commented 8 years ago
Any info on how to apply this patch would be useful. Thanks.

Original comment by zu...@wsg.co on 14 Mar 2014 at 5:08

GoogleCodeExporter commented 8 years ago
Patch is not necessary if you're running latest version of GSG. In my case, 404 
pages were included in sitemaps as the server returned 200 http response code, 
even though the page was a 404 page. Basically, I was setting code to 404 and 
later in the code changing it to 200, without noticing it.

So first thing you need to do is to find out whether your server for specific 
page is actually returning a 404 error code. If not - you gotta fix it first.

Once fixed, I followed these steps to regenerate sitemaps:

1. Stopped GSG daemon.
2. Removed cache folder from GSG installation path.
3. Removed sitemaps from website folder.
4. Started GSG daemon.

After that it took some time for GSG to regenerate sitemaps, as they get all 
deleted completely, but the new sitemaps do not include 404 pages.

Hope this helps.

Original comment by zu...@wsg.co on 27 Mar 2014 at 1:57