GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
642 stars 100 forks source link

Optimize sitemaps and improve integration into Google Search Console #4361

Closed dlennox24 closed 2 months ago

dlennox24 commented 1 year ago

User Story

Currently, sitemaps in Google Search Console are returning a greater amount of pages than are currently available in the catalog. We need to determine why this is happening and ensure sitemaps are being generated correctly and are being ingested by Google correctly.

Acceptance Criteria

Sketch

dlennox24 commented 1 year ago

Search Console has detected our changes to the sitemap sizes and begun parsing the larger sitemaps. There are now a total of 38 individual sitemaps in the sitemap index with each containing roughly 10k records.

Image

dlennox24 commented 1 year ago

Image

Image

Search Console is showing nearly 2.5 million potential pages to be index. Data.gov doesn't have this many pages (sitemap is at ~380k pages). A large portion of this are marked as 404s (~1.1m).

dlennox24 commented 11 months ago

Image

Image

dlennox24 commented 10 months ago

I do not have write access to 18F/dns. @FuhuXia it looks like you have access. Would you be able to add the following to the data.gov.tf file?

https://github.com/18F/dns/blob/main/terraform/data.gov.tf#L503

  records = [
    "621df521f1e44ac69a670f325dc86889",
    "v=spf1 ip4:34.193.244.109 include:gsa.gov ~all",
-   "n6fgn8dyh1hhqsmghskdplss7zp7yt7q"
+   "n6fgn8dyh1hhqsmghskdplss7zp7yt7q",
+   "google-site-verification=K1_M1KkxyZYMiqHHAmlUVcXgYxV6myWSNYAyLrUk_PA"
  ]
}
btylerburton commented 10 months ago

Hi @dlennox24 do you think this will be a one-time update or would it be worth getting you added to 18F organization?

dlennox24 commented 10 months ago

Hi @btylerburton! One time. This should give us verification for the whole data.gov domain including all subdomains within Google Search Console.

btylerburton commented 10 months ago

I can push up a PR today then.

btylerburton commented 10 months ago

Ah yes, I rememember now. We have to fork and then PR that into here.

dlennox24 commented 10 months ago

Thanks Tyler! The 18F PR was merged and I verified that the data.gov domain is available in the Search Console. I added the permissions to the team as the same as what was on the other domain. This will allow for the capture and monitoring of any subdomain on data.gov.

The Sitemaps ingestion also appears stable, and Google is reporting that all have been parsed. chrome_zH6N942LvM

Non-indexed pages are still at a slight trend down and indexed pages are trending up. The vast majority of the non-index pages are 404s or duplicates (Google considers queries to be duplicates so this is what most of those are, eg https://catalog.data.gov/dataset?tags=asthma&_organization_limit=0 is a dup of https://catalog.data.gov/dataset?organization=noaa-gov&_tags_limit=0). I believe the harvest process naturally creates some churn so there will always be some 404s as datasets are removed or their names/url changed. chrome_K88ppBLWTd

I believe we can move this ticket to done and move to a monitoring state for Search Console.