GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://mediacloud.org/
Other
181 stars 64 forks source link

Detection of sitemap if it's not present in robots.txt #8

Closed kienli closed 5 years ago

kienli commented 5 years ago

First of all, thanks for this great package. I like it. It works better than any of my implementations of sitemap parsers. However, I see features which can be addressed. And I would rather contribute to this project than invent something else. I would be cool to detect sitemap.xml in another location, not only in robots.txt. It can increase chances to find and parse the sitemap. Why? Because the robots.txt file is an optional file. Also, it's optional to put sitemap declaration inside of the robots.txt. But we can guess, that sitemap.xml is sometimes placed in https://www.example.com/sitemap.xml

What do you think?

pypt commented 5 years ago

Hey @kienli, thank you for the kind words!

Interesting idea, although I suspect that most of the websites do post a link to their sitemap in robots.txt. In addition to /sitemap.xml[.gz], possibly some CMSes provide their sitemaps at very predictable paths (think www.joomla-website.com/index.php?module=sitemap or something like that) so we could take those into account too.

However, I think it would make sense to test a few websites before implementing this feature first, e.g. a test script could fetch /robots.txt and /sitemap.xml[.gz] of every URL in the sample list and see how many "shadow" sitemaps it would be able to find.

Would you be able to run such a test? I can provide you with a sample of 1000 / 10,000 / 100,000 news website URLs for sample data (we at the Media Cloud project work with those).

kienli commented 5 years ago

@pypt good idea! I wanted to do the same, to scan a big sample of websites to understand how heavily we can/should rely on robots.txt for big/medium/small websites in our project and to see the percentage of websites without robots.txt/sitemap.xml.

The idea came from the internal test: the ultimate_sitemap_parser failed to scan two of our websites (for different reasons though). One of the reasons was the missing robots.txt, but with big sitemap index at a predictable path.

I can run such a test.

pypt commented 5 years ago

Cool, thanks! So, do you need a list of website URLs for your tests or do you have your own?

kienli commented 5 years ago

I would like to start with your list of 100,000 websites, if possible. How I can get it? My email address is in my profile or you can post a link here.

kienli commented 5 years ago

I did a small test on 7335 websites from this dataset and searched for robots.txt and sitemap in it during the first round.

If sitemap was present in robots.txt, I saved the path. If sitemap was not present, during the second round I iterated over the saved paths for the given website with a hope to guess the sitemap location and saved the result.

Here what I found so far.

From 7335 I got 6014 websites, which responded with 200 or 403 on the homepage.

19,7% (1182 websites out of 6014) don't have robots.txt. 80,3% (4832 websites) have it.

From those with robots.txt 63% (3043 out of 4832) don't have a sitemap in it. 37% (1789 out of 4832) have sitemap it there.

In 60% of the cases (1061 out of 1789), the sitemap is located at /sitemap.xml

For the second round, I took 1182 websites, which don't have robots.txt and tried to guess, based on the collected sitemaps locations.

The script could guess 285 sitemaps out of 1182, it's 24%.

/sitemap.xml was used by 43% of these websites (123 out of 285).

Other variations: /sitemap_index.xml - 24 times /.sitemap.xml - 9 times /sitemap - 8 times /admin/config/search/xmlsitemap - 8 times /sitemap/sitemap-index.xml - 4 times

kienli commented 5 years ago

There is a sense to try to call /sitemap.xml at least once, if the sitemap is not present in robots.txt or robots.txt is missing. It can increase chances to find the sitemap in general. I found out, that some of the websites have a redirection from /sitemap.xml to the real sitemap location, e.g. Yoast SEO plugin for Wordpress does it.

pypt commented 5 years ago

Very cool, thank you Alex! Yes, I agree that it's worth it to blindly try /sitemap.xml (and other similar paths) without robots.txt being present at all.

pypt commented 5 years ago

@kienli, if by any chance you're a student at some sort of a university, you can consider implementing this task as a GSoC 2019 project:

https://docs.google.com/document/d/1GGbGtFOMS07dog4yzglY5hZCDc41ZQjY1RqRKOlW0B4/edit?usp=sharing

https://cyber.harvard.edu/gsoc/MediaCloud

https://summerofcode.withgoogle.com/organizations/5825827049046016/

kienli commented 5 years ago

@pypt thanks for the suggestion. I'm not a student. But I think, all the mentioned ideas are cool and I hope to contribute to some of them, if possible.

pypt commented 5 years ago

Thanks again @kienli for your initial research on the issue, I've added support for trying out a few extra URL paths to find sitemaps not published in robots.txt:

https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/blob/develop/usp/tree.py#L13-L22

Will release an updated version soon.

kienli commented 5 years ago

That's awesome. Thanks a lot!

pypt commented 5 years ago

0.3 released.