Reasons

Not using the Permalink

Agencies are sometimes registering the constructed url either generally from the Content Management Systems (CMS) - Drupal, Wordpress, Django, etc, etc... etc. Constructed URLs are created dynamically by this systems by replacing spaces in the title or document with dash - or underscore _:

Example:

page-title or page_title

file_or-MM/DD/YYYY

The use of slugTitle or slugDocumentTitle is the long URL; which can break systematically when the title is modified and/or file replaced/updated/moved. However most CMS also have some form of a Unique ID for each page or file created/posted...

However most CMS have some concept of a permalink which is the URL constructed from the systems Unique Identifier for the page it created for that new site, effort, blog, etc. where you eventually hosted data for it...

Agencies could "crawl the results" or ask to be directed to an export of their broken links as a CSV file, which could then have a simple filter in a spreadsheet looking for either a dash - or underscore _ and work with their web teams to swap out their long URLs for permanent to avoid a systemic issue long term...

JJediny commented 8 years ago

Closing/Consolidating related issues to centralize the discussion here...

470

468

JJediny commented 8 years ago

Per #470 #468...

Need to investigate how to handle some of the common issues with links creation/updating from harvest sources. There are likely 2 types of harvest sources Static & Dynamic:

Static Sources:

Are sometimes used to publish iso/fgdc metadata collections and are harvested via Web Accessible Folders (WAF) that CKAN crawls, maps, creates records based on this metadata. Where these records are published there is an added issue that the actual accessURLs to the data sources are not in sync with this published metadata and can cause broken links and re-directs.

Dynamic Harvest Sources:

Are web services or live catalogs themselves - where CKAN is harvest a dynamic standard service (ex. Web Catalog Service) or custom created API. For these services the issue is not in the link creation as any change will be reflected in the latest harvest if done daily... Rather the issue here is how at a system-to-system level the resource is determined to be unique (i.e. how CKAN validates wheter a previous entry was changed vs not reconizing that there is a previous entry and instead it creates a new entry and orphans the previous).

Examples of Dynamic Harvest Points

pyCSW
Geonetwork
Geoportal
WAF folder that is dynamically updated by a system NOT a human (Ex. http://warp.nepanode.anl.gov/waf)

rebeccawilliams commented 4 years ago

Does this no longer work? https://catalog.data.gov/report/broken-links

GSA / datagov-wptheme

broken link reporting #697

Reasons

Not using the Permalink

`page-title` or `page_title`

`file_or-MM/DD/YYYY`

470

468

Static Sources:

Dynamic Harvest Sources:

GSA / datagov-wptheme

broken link reporting #697

Reasons

Not using the Permalink

page-title or page_title

file_or-MM/DD/YYYY

470

468

Static Sources:

Dynamic Harvest Sources:

`page-title` or `page_title`

`file_or-MM/DD/YYYY`