edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Scan sitemap files for pages to monitor? #136

Closed Mr0grog closed 4 years ago

Mr0grog commented 5 years ago

A site’s robots.txt file can list any number of sitemaps that search engines will generally read and provide extra weight towards in searches. As we consider updating our URL lists, it might be useful to seed the list with these (or maybe just monitor them and continually update our page list when new things get added).

Not all sites have these (e.g. energy.gov), while others have many (e.g. epa.gov is kind of crazy). A simple example might be https://ferc.gov/robots.txt:

# robots.txt generated at http://www.mcanerin.com
User-agent: *
Disallow: 
Disallow: /cgi-bin/
Sitemap:http://www.ferc.gov/sitemap.xml

Which leads to the sitemap https://www.ferc.gov/sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Sitemap File Generated by https://freesitemapgenerator.com/ at Thu, 16 Feb 2017 18:41:09 +0100 -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
                           http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://www.ferc.gov/</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.00</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/CalendarFiles/20170208112439-No%20meeting.pdf</loc>
        <lastmod>1970-01-01T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.31</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/EventCalendar/EventsList.aspx?View=listview</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.30</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/contact-us/compliance-help-desk.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/whats-new/registration/vegetation-mgt-issues-form.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/resources/glossary.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/resources/acronyms.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.23</priority>
    </url>
</urlset>

(Note this doesn’t obviate the need for tools like Walk, since these kind of sitemaps general don’t list every page, and not all sites have them.)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.