Open rebeccawilliams opened 8 years ago
Agencies are sometimes registering the constructed
url either generally from the Content Management Systems (CMS) - Drupal, Wordpress, Django, etc, etc... etc. Constructed URLs are created dynamically by this systems by replacing spaces in the title or document with dash -
or underscore _
:
Example:
page-title
orpage_title
file_or-MM/DD/YYYY
The use of slugTitle
or slugDocumentTitle
is the long URL; which can break systematically when the title is modified and/or file replaced/updated/moved. However most CMS also have some form of a Unique ID for each page or file created/posted...
However most CMS have some concept of a permalink which is the URL constructed from the systems Unique Identifier for the page it created for that new site, effort, blog, etc. where you eventually hosted data for it...
Agencies could "crawl the results" or ask to be directed to an export of their broken links as a CSV file, which could then have a simple filter in a spreadsheet looking for either a dash
-
or underscore_
and work with their web teams to swap out their long URLs for permanent to avoid a systemic issue long term...
Closing/Consolidating related issues to centralize the discussion here...
Per #470 #468...
Need to investigate how to handle some of the common issues with links creation/updating from harvest sources. There are likely 2 types of harvest sources Static & Dynamic:
Are sometimes used to publish
iso/fgdc metadata collections and are harvested via Web Accessible Folders (WAF) that CKAN crawls, maps, creates records based on this metadata. Where these records are published there is an added issue that the actual
accessURLs to the data sources are not in sync with this published
metadata and can cause broken links and re-directs.
Are web services or live catalogs themselves - where CKAN is harvest a dynamic standard service (ex. Web Catalog Service) or custom created API. For these services the issue is not in the link creation as any change will be reflected in the latest harvest if done daily... Rather the issue here is how at a system-to-system level the resource is determined to be unique (i.e. how CKAN validates wheter a previous entry was changed vs not reconizing that there is a previous entry and instead it creates a new entry and orphans the previous).
Examples of Dynamic Harvest Points
Does this no longer work? https://catalog.data.gov/report/broken-links
At: http://catalog.data.gov/report/broken-links
Can the following be defined on the Data.gov broken link page?