clowder-framework / clowder

A data management system that allows users to share, annotate, organize and analyze large collections of datasets. It provides support for extensible metadata annotation using JSON-LD and a distribute analytics event bus for automatic curation of uploaded data.
https://clowderframework.org/
University of Illinois/NCSA Open Source License
34 stars 17 forks source link

Create sitemap.xml for all the dataset pages #351

Open MBcode opened 2 years ago

MBcode commented 2 years ago

We want the new schema.org metadata from issue #335 to be findable by https://datasetsearch.research.google.com via a sitemap.xml listing those pages

The url for the sitemap could go to a route that generates it on the fly, or it could be done via a chron job and cached (maybe as a cfg option).

Some places have so much to crawl that their main sitemap has links to sub sitemaps of say 1k links each. We will have to allow for this.

The main starting place is deciding if the whole enpoint should be made findable for a crawl, or some subset (eg. space), so this could also end up as a cfg option at some point, down the road.

MBcode commented 2 years ago

Wrote simple py requests to create a usable sitemap.xml as a start; we can iterate as much as ppl like, have notes on howto do w/in scala but then it can't be reused in v2

MBcode commented 2 years ago

For very large spaces with very large sets of datasets, that don't change all that much (eg. time-series) allow for: sitemap at dataset or space level, after/near-where you set it to be public.

To have google and others harvest it, we will have to call the space a schema:Dataset but can have another element that shows, that is is actually something more like a DataCatalog

So I will sketch up something like the other mapping here. So we can get further comment on any new elements.
So we can add a new to_jsonld method to implement this mapping underneath the spaces html pages.

UX-wise, alternately we could always list the space, and just have the radio button have: private,public, pubic-w/sitemap

MBcode commented 2 years ago

Starting mapping,for spaces looks like this: spaceLD = Json.obj(

MBcode commented 2 years ago

I will consolidate all the linked notes, into one summary, that we can ok for me to move forward on more

MBcode commented 2 years ago

Most of the changes already creeped into the last pr (other class to ld+json scripts), except for actually making the sitemap, that has code, but no comments on; so not sure if I have the ok to finish this?

MBcode commented 1 year ago

have sitemap.xml branch to try https://github.com/dfabulich/sitemapgen4j but could even start w/just looping over datasets and putting w/in tags

MBcode commented 1 year ago

Have a route to get the sitemap.xml and a way to check the cached version

MBcode commented 1 year ago

Decided to make an add-sitemap-route draft PR, on this direct branch vs the sitemap.xml fork