Closed philipashlock closed 6 years ago
there is also a note after that
Note: A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, your Sitemap index file must be UTF-8 encoded.
looks like we're giving filestore.data.gov url instead on https://catalog.data.gov/robots.txt :
Sitemap: https://filestore.data.gov/gsa/catalog/sitemap.xml.gz
@alex-perfilov-reisys Can we add a URL rewrite rule on catalog.data.gov or something to proxy those files over to be served from the same domain?
Thanks for pointing out the limitation on sitemap.xml. I will split it into multiple xmls. I will also provide an sitemaps index file for search engine to find the multiple files.
Hosting sitemap on different subdomain should be fine, as long as we can verify the ownership on the google console. The info alex included is talking about different things. Inside our sitemap all of our links should be pointing to catalog.data.gov. Pointint to links on other domain such as www.data.gov or inventory.data.gov are not allowed. It does not apply to our case.
This has been completed. In https://catalog.data.gov/robots.txt the sitemap is defined as https://filestore.data.gov/gsa/catalog/sitemap/sitemap.xml, which is actaully a sitemap index file. It contains a list of 50+ sitemap xml files, each with 50K package entries. Let us view it on google site search console to verify google can parse it.
Search engines do not appear to be parsing the new sitemap file created for catalog.data.gov (closed in #769). This is probably because the file exceeds the maximum size for a sitemap. As defined in the specification: