Make it easier for users to determine package id of a record

ageara commented 8 years ago

Issue: As a data catalogue user, I want to easily discover the immutable, system generated Package ID *e.g. http://catalogue.data.gov.bc.ca/dataset/42f7ca99-e7f3-40f7-93d7-f2500cccc315

so that I can use it to embed or bookmark a record - that will not be broken by alteration of the record's Semantic URL (SLUG) *e.g. http://catalogue.data.gov.bc.ca/Dataset/bc-data-catalogue-content

Proposed solution: Providing a hyperlink reference to the Package ID as a permalink in the package header.module-content.page-header to make it easier for users to find and reference the Package ID.

davidread commented 8 years ago

I can see why this is suggested. However IDs are unhelpful for humans. I'd rather people didn't pass around URLs which are gobbledegook - they should use the readable one in their browser bar.

There's some W3C best practice on URIs:

URIs should be stable and reliable in order to maximize the possibilities of reuse that Linked Data brings to users. There must be a balance between making URIs readable and keeping them more stable by removing descriptive information that will likely change.

Going along with that, in the unlikely event that a dataset changes name (slug), a 301 redirect is put in from the old name.

Also, distributing links based on the package ID also breaks another principle, of using a database primary key for a secondary purpose. It just stores up problems for the future.

ageara commented 8 years ago

@davidread unfortunately we have had more than a few instances where datasets have changed names - most commonly because they had originally included a date in a dataset name - and have removed it as they built the dataset into a time series. This presented issues where the SLUG was embedded in other systems - so not actually humans trafficking in gobbledygook directly ;).

Is there any formal system within CKAN for managing 301s - or are people just managing these adhoc within their reverse proxies?

How do maintainers watch for SLUG changes - or is this an on request basis.

davidread commented 8 years ago

Yes, we occasionally manually put in a 301 at the server level.

I'd very much like to see this developed into ckan functionality. I'd appreciate it if you were to contribute to the ideas on this here: https://github.com/ckan/ideas-and-roadmap/issues/43 as so far there's been little interest. And if you want to write something too, that would be even better.

mdunhamwilkie commented 5 years ago

I have a follow-on question for the one raised by @ageara in 2015 in this use case.

Issue: As a data catalogue user, I want to easily discover the immutable, system generated Package ID *e.g. http://catalogue.data.gov.bc.ca/dataset/42f7ca99-e7f3-40f7-93d7-f2500cccc315

so that I can use it to embed or bookmark a record - that will not be broken by alteration of the record's Semantic URL (SLUG) *e.g. http://catalogue.data.gov.bc.ca/Dataset/bc-data-catalogue-content

Proposed Solution I've noticed that some web sites combine both a permalink and a slug in the same URL. See for example https://discuss.newrelic.com/t/synthetics-for-siebel-open-ui/59566. I'm assuming that the site is using wildcard redirects as the SLUG can be replaced by anything at all and the URL will still redirect to the same page. e.g., try https://discuss.newrelic.com/t/ANYTHING-AT-ALL/59566.

So it appears that with this strategy we will be able to have

an immutable URL (permalink), not influenced by changes in package/dataset names
an SEO-friendly URL
a reader-friendly URL

Does this seem like a more acceptable approach? Do you know of other CKAN sites that have done something similar, or better?

fyi @dkelsey

metaodi commented 5 years ago

Generally I like your idea @mdunhamwilkie of separating the readability/SEO aspect from the unique identifier.

We chose a different approach on https://opendata.swiss: since all of the data published on the portal has some sort of source system, we force the data publishers to provide a unique identifier (which is validated by a custom validator in CKAN).

Then we use this identifier to create a permalink. E.g. unique identifier of https://opendata.swiss/en/dataset/register-ursprungsbezeichnungen-gub-und-geografische-angaben-gga-pflanzliche-produkte is 199c9078-9da7-4a77-b811-b342adc2a116@bundesamt-fur-landwirtschaft-blw and then the permalink is https://opendata.swiss/en/perma/199c9078-9da7-4a77-b811-b342adc2a116%40bundesamt-fur-landwirtschaft-blw

There are several reasons to use an identifier from outside CKAN rather than rely on the CKAN internal package_id:

if you use harvesters (which we do), you end up with new package_id for the same dataset, if you clear a harvest source
as a data publisher you don't need CKAN to know the permalink. This is super valuable for data publishers that want to link to datasets on their own website (e.g. prepare a press release before the dataset is even on the portal)
if you want to change the identifier (for whatever reason), you can do it easily

Note: our identifiers in the format "xyz@organisation-slug" are really bad. I wish we would simply use GUIDs for this purpose, but we're not there yet. But the basic principle above applies nonetheless.

davidread commented 5 years ago

I'm not warm to putting IDs in URLs - it's off-putting to users and confusing what it means - all to take care of an edge-case. Let's design the best experience for the main case. I still prefer the idea of squirrelling away old dataset names and redirecting from them. The real use case is not getting a permalink, it is that dataset URLs don't break over time, and the widely preffered way is to use redirects.

However if you're not up for that coding work, CKAN master (forthcoming in 2.9) has a feature relevant to this: https://github.com/ckan/ckan/pull/4317 Redirect /dataset/[id] -> /dataset/[name]

So it would be easy to add a permalink /dataset/[id] to your template. You could backport the PR to your CKAN version pretty easily.

I like @metaodi's idea of using source IDs as the package ID when harvesting. (Where an ID is not available, or you create a dataset in the form, CKAN can still make one up.)

ckan / ideas

Make it easier for users to determine package id of a record #161