AgenteFarron / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

[Robots] Resolve relative URL for sitemaps #32

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
2014-03-27 13:55:25,730 WARN crawlercommons.robots.SimpleRobotRulesParser: 
Problem processing robots.txt for 
http://www.iglobal.co/mexico/render_phone_view/victoria-cortes-maria-del-carmen-
1

2014-03-27 13:55:25,730 WARN crawlercommons.robots.SimpleRobotRulesParser: 
    Invalid URL with sitemap directive: /sitemap.xml

Original issue reported on code.google.com by digitalpebble on 27 Mar 2014 at 3:07

GoogleCodeExporter commented 8 years ago
From everything I've read online, the sitemap directive must be a fully 
qualified URL. Are you seeing relative URLs that Googlebot/bing/Yahoo actually 
follow?

Original comment by kkrugler...@transpac.com on 31 Mar 2014 at 12:49

GoogleCodeExporter commented 8 years ago
Just checking in - do we have any more details on whether relative sitemap URLs 
should be accepted?

Original comment by kkrugler...@transpac.com on 12 Apr 2014 at 5:33

GoogleCodeExporter commented 8 years ago
http://www.sitemaps.org/protocol.html#submit_robots

Fully qualified URL's Ken. I do not have any problem with prepending the host 
to the relative entry in order to make robots.txt sitemap 
extraction/identification more complete, however it is _not_ standard practice. 

Original comment by lewis.mc...@gmail.com on 13 Apr 2014 at 11:09

GoogleCodeExporter commented 8 years ago
Hi Lewis - I was hoping to find out from Julien whether there were any crawlers 
that actually handled relative URLs to sitemaps. If there are, then I agree we 
should handle them, otherwise I think it's better to stay (somewhat) consistent 
with current industry practices, and continue rejecting them.

Original comment by kkrugler...@transpac.com on 14 Apr 2014 at 12:04

GoogleCodeExporter commented 8 years ago
Hi, Ken sorry for the late reply. Haven't looked at the way this is handled by 
main crawlers yet. I agree that if it is not done then let's stick to that. 

Original comment by digitalpebble on 14 Apr 2014 at 8:50

GoogleCodeExporter commented 8 years ago
Am seeing loads of cases like these. I have no idea how to checking whether 
Google handles them or not but given the frequency I think we should do it. It 
probably isn't what the robots.txt specs recommend but it seems to be a 
relatively common practice.

Original comment by digitalpebble on 16 Jun 2014 at 8:57

GoogleCodeExporter commented 8 years ago
OK, I'll do it - give me some test cases as a patch, and I'll add the support :)

Original comment by kkrugler...@transpac.com on 24 Jun 2014 at 3:07

GoogleCodeExporter commented 8 years ago
Hi Ken, here is a patch to create a test case. Thanks!

Original comment by digitalpebble on 25 Jun 2014 at 8:36

Attachments:

GoogleCodeExporter commented 8 years ago
Attached patch which implements the functionality + simplifies the code. 
Comments welcome.

Original comment by digitalpebble on 12 Jan 2015 at 12:47

Attachments:

GoogleCodeExporter commented 8 years ago
It is wirth mentioning that the latest patch includes the previous patch inside 
of it so there is no need to download both (just the latest)

Original comment by avrah...@gmail.com on 12 Jan 2015 at 5:50

GoogleCodeExporter commented 8 years ago
Any objections to committing this?

Original comment by digitalpebble on 22 Jan 2015 at 10:36

GoogleCodeExporter commented 8 years ago
I looked at it and it seems ok, but I am no robots expert 

Original comment by avrah...@gmail.com on 22 Jan 2015 at 10:39

GoogleCodeExporter commented 8 years ago
Committed revision 158.
Thanks!

Original comment by digitalpebble on 22 Jan 2015 at 10:54

GoogleCodeExporter commented 8 years ago
Hi Julien - one quick question. Previously the code had this check:

-                if ((hostname != null) && (hostname.length() > 0)) {

before calling state.addSitemap(sitemap). So is it the case now that you'll 
never have a sitemap URL which doesn't have a real hostname, as that was only 
happening with relative URLs?

Original comment by kkrugler...@transpac.com on 22 Jan 2015 at 2:28

GoogleCodeExporter commented 8 years ago
Hi Ken. Not sure what the previous code was supposed to do. It was first 
checking the hostname with 

String hostname = new URL(sitemap).getHost();

then checking it again with : 

hostname = new URI(sitemap).getHost();

which seems completely unecessary

> So is it the case now that you'll never have a sitemap URL which doesn't have 
a real hostname, as that was only happening with relative URLs?

we still check that it has a hostname so we are completely safe.

see 
[https://code.google.com/p/crawler-commons/source/diff?spec=svn158&r=158&format=
side&path=/trunk/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java
]

Original comment by digitalpebble on 22 Jan 2015 at 2:40