GSA / datagov-wptheme

Data.gov WordPress Theme (obsolete)
https://www.data.gov
Other
1.88k stars 411 forks source link

Update CKAN catalog sitemap.xml to use Sitemap index files #798

Closed philipashlock closed 6 years ago

philipashlock commented 7 years ago

Search engines do not appear to be parsing the new sitemap file created for catalog.data.gov (closed in #769). This is probably because the file exceeds the maximum size for a sitemap. As defined in the specification:

the sitemap file once uncompressed must be no larger than 50MB. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.

vasili4 commented 7 years ago

there is also a note after that

Note: A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, your Sitemap index file must be UTF-8 encoded.

looks like we're giving filestore.data.gov url instead on https://catalog.data.gov/robots.txt :

Sitemap: https://filestore.data.gov/gsa/catalog/sitemap.xml.gz

philipashlock commented 7 years ago

@alex-perfilov-reisys Can we add a URL rewrite rule on catalog.data.gov or something to proxy those files over to be served from the same domain?

FuhuXia commented 7 years ago

Thanks for pointing out the limitation on sitemap.xml. I will split it into multiple xmls. I will also provide an sitemaps index file for search engine to find the multiple files.

Hosting sitemap on different subdomain should be fine, as long as we can verify the ownership on the google console. The info alex included is talking about different things. Inside our sitemap all of our links should be pointing to catalog.data.gov. Pointint to links on other domain such as www.data.gov or inventory.data.gov are not allowed. It does not apply to our case.

FuhuXia commented 7 years ago

This has been completed. In https://catalog.data.gov/robots.txt the sitemap is defined as https://filestore.data.gov/gsa/catalog/sitemap/sitemap.xml, which is actaully a sitemap index file. It contains a list of 50+ sitemap xml files, each with 50K package entries. Let us view it on google site search console to verify google can parse it.