Closed GoogleCodeExporter closed 8 years ago
From everything I've read online, the sitemap directive must be a fully
qualified URL. Are you seeing relative URLs that Googlebot/bing/Yahoo actually
follow?
Original comment by kkrugler...@transpac.com
on 31 Mar 2014 at 12:49
Just checking in - do we have any more details on whether relative sitemap URLs
should be accepted?
Original comment by kkrugler...@transpac.com
on 12 Apr 2014 at 5:33
http://www.sitemaps.org/protocol.html#submit_robots
Fully qualified URL's Ken. I do not have any problem with prepending the host
to the relative entry in order to make robots.txt sitemap
extraction/identification more complete, however it is _not_ standard practice.
Original comment by lewis.mc...@gmail.com
on 13 Apr 2014 at 11:09
Hi Lewis - I was hoping to find out from Julien whether there were any crawlers
that actually handled relative URLs to sitemaps. If there are, then I agree we
should handle them, otherwise I think it's better to stay (somewhat) consistent
with current industry practices, and continue rejecting them.
Original comment by kkrugler...@transpac.com
on 14 Apr 2014 at 12:04
Hi, Ken sorry for the late reply. Haven't looked at the way this is handled by
main crawlers yet. I agree that if it is not done then let's stick to that.
Original comment by digitalpebble
on 14 Apr 2014 at 8:50
Am seeing loads of cases like these. I have no idea how to checking whether
Google handles them or not but given the frequency I think we should do it. It
probably isn't what the robots.txt specs recommend but it seems to be a
relatively common practice.
Original comment by digitalpebble
on 16 Jun 2014 at 8:57
OK, I'll do it - give me some test cases as a patch, and I'll add the support :)
Original comment by kkrugler...@transpac.com
on 24 Jun 2014 at 3:07
Hi Ken, here is a patch to create a test case. Thanks!
Original comment by digitalpebble
on 25 Jun 2014 at 8:36
Attachments:
Attached patch which implements the functionality + simplifies the code.
Comments welcome.
Original comment by digitalpebble
on 12 Jan 2015 at 12:47
Attachments:
It is wirth mentioning that the latest patch includes the previous patch inside
of it so there is no need to download both (just the latest)
Original comment by avrah...@gmail.com
on 12 Jan 2015 at 5:50
Any objections to committing this?
Original comment by digitalpebble
on 22 Jan 2015 at 10:36
I looked at it and it seems ok, but I am no robots expert
Original comment by avrah...@gmail.com
on 22 Jan 2015 at 10:39
Committed revision 158.
Thanks!
Original comment by digitalpebble
on 22 Jan 2015 at 10:54
Hi Julien - one quick question. Previously the code had this check:
- if ((hostname != null) && (hostname.length() > 0)) {
before calling state.addSitemap(sitemap). So is it the case now that you'll
never have a sitemap URL which doesn't have a real hostname, as that was only
happening with relative URLs?
Original comment by kkrugler...@transpac.com
on 22 Jan 2015 at 2:28
Hi Ken. Not sure what the previous code was supposed to do. It was first
checking the hostname with
String hostname = new URL(sitemap).getHost();
then checking it again with :
hostname = new URI(sitemap).getHost();
which seems completely unecessary
> So is it the case now that you'll never have a sitemap URL which doesn't have
a real hostname, as that was only happening with relative URLs?
we still check that it has a hostname so we are completely safe.
see
[https://code.google.com/p/crawler-commons/source/diff?spec=svn158&r=158&format=
side&path=/trunk/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java
]
Original comment by digitalpebble
on 22 Jan 2015 at 2:40
Original issue reported on code.google.com by
digitalpebble
on 27 Mar 2014 at 3:07