Implement sitemap xml for CKAN data catalog

GSA / datagov-wptheme

Data.gov WordPress Theme (obsolete)

https://www.data.gov

Other

1.88k stars 411 forks source link

Implement sitemap xml for CKAN data catalog #769

Closed philipashlock closed 6 years ago

philipashlock commented 7 years ago

We've done this already for wordpress (#242) but not for CKAN and we've seen evidence that this would significantly improve search engine indexing. We can test with the available CKAN sitemap extension or explore other options if that's not viable.

FuhuXia commented 7 years ago

This extension is more suitable for small ckan application . The /sitemap.xml will dynamically load all packages from the database to memory and iterate through them to generate a full list in xml.

For catalog.data.gov, this means the sitemap url will take forever to generate, ckan app will get bogged down on every request to sitemap url, and the generated sitemap, if ever successful, will be a few hundred MB in size.

kvuppala commented 7 years ago

@philipashlock @FuhuXia Is it feasible to generate the sitemap.xml offline and have it available in akamai cache? We also need to make sure we block this URL on the admin catalog, so that page is not hit accidentally.

FuhuXia commented 7 years ago

I think we can run an equivalent python script as a nightly cron job, and store the xml file in s3.

JJediny commented 7 years ago

I like the idea of caching the resource, IDK if it makes sense to host on s3 or just on apache as a static file exposed at catalog.data.gov/sitemap.xml so we dont have to over-complicate hosting. Makes sense that we'd have this timed as a cron ~early morning after nightly harvests run and low traffic as a cron job/script. So long as it's not a dynamic call - whatever is easier.

We should also take the same approach for generating a consolidated data.json file to test re #315

philipashlock commented 7 years ago

Yes, let's prevent this from being generated dynamically by URL and overwrite the URL (or put a redirect) and point to a cached copy generated by a cron job.

FuhuXia commented 6 years ago

Coding has been completed for the sitemap and has been deployed onto prod. We assemble the xml using solr instead of db, so it is way faster. The generated xml file is compressed and pushed to s3. No need to stick to the url /sitemap.xml. Search engine can take any url as along as it is defined in robots.txt. A s3 url entry has been added to https://catalog.data.gov/robots.txt.

https://github.com/GSA/ckanext-geodatagov/tree/sitemap-xml-s3