Closed dlennox24 closed 2 months ago
Search Console has detected our changes to the sitemap sizes and begun parsing the larger sitemaps. There are now a total of 38 individual sitemaps in the sitemap index with each containing roughly 10k records.
Search Console is showing nearly 2.5 million potential pages to be index. Data.gov doesn't have this many pages (sitemap is at ~380k pages). A large portion of this are marked as 404s (~1.1m).
I do not have write access to 18F/dns. @FuhuXia it looks like you have access. Would you be able to add the following to the data.gov.tf file?
https://github.com/18F/dns/blob/main/terraform/data.gov.tf#L503
records = [
"621df521f1e44ac69a670f325dc86889",
"v=spf1 ip4:34.193.244.109 include:gsa.gov ~all",
- "n6fgn8dyh1hhqsmghskdplss7zp7yt7q"
+ "n6fgn8dyh1hhqsmghskdplss7zp7yt7q",
+ "google-site-verification=K1_M1KkxyZYMiqHHAmlUVcXgYxV6myWSNYAyLrUk_PA"
]
}
Hi @dlennox24 do you think this will be a one-time update or would it be worth getting you added to 18F organization?
Hi @btylerburton! One time. This should give us verification for the whole data.gov domain including all subdomains within Google Search Console.
I can push up a PR today then.
Ah yes, I rememember now. We have to fork and then PR that into here.
Thanks Tyler! The 18F PR was merged and I verified that the data.gov domain is available in the Search Console. I added the permissions to the team as the same as what was on the other domain. This will allow for the capture and monitoring of any subdomain on data.gov.
The Sitemaps ingestion also appears stable, and Google is reporting that all have been parsed.
Non-indexed pages are still at a slight trend down and indexed pages are trending up. The vast majority of the non-index pages are 404s or duplicates (Google considers queries to be duplicates so this is what most of those are, eg https://catalog.data.gov/dataset?tags=asthma&_organization_limit=0
is a dup of https://catalog.data.gov/dataset?organization=noaa-gov&_tags_limit=0
). I believe the harvest process naturally creates some churn so there will always be some 404s as datasets are removed or their names/url changed.
I believe we can move this ticket to done and move to a monitoring state for Search Console.
User Story
Currently, sitemaps in Google Search Console are returning a greater amount of pages than are currently available in the catalog. We need to determine why this is happening and ensure sitemaps are being generated correctly and are being ingested by Google correctly.
Acceptance Criteria
Sketch