This ticket will probably stay outside of milestones, unless we see a straight path ahead to closing it, for a time. The previous ticket of this nature got turned into an action ticket (#60), and now this one is supposed to be the new place to gather ideas and turn them into separate action tickets.
1. Introduction: places to look for sanity checks
We have a somewhat distributed sanity-checking functionality now (and that happened in a way by design):
the main sanity-checker page (small at the time of writing, but it's already been useful)
list of missing formats (mentioned in the recommendations by ID but the IDs belong to not-yet-existent format-description files), under "Data Deposition Formats"
list of existing format-description files that are not mentioned by any recommendation -- also under "Data Deposition Formats"
list of format-description files that don't mention any file extensions -- under "File Extensions"
list of format-description files that don't mention any media types -- under "Media Types"
and, indirectly, the Statistics page, which is meant (mainly) for aggregating and visualizing the data content, but indirectly might point to some local insanities ;-), especially because, for now, it contains some meta-statistics that still tell us more about the content of the SIS rather than the individual centres and formats. Note also that this page has a dedicated discussion issue (#67). Maybe it should even get split into something like "SIS statistics" and "Data Visualization", later on, but let's ignore that in the present issue.
This page should eventually have structured logic and probably repeat some of the distributed information (which, given the modular structure of the SIS, should be trivial).
2.1. What it contains
Context
Target
Check
1.
recommendation list
domain name
check if set, check if valid
2.
recommendation list
recommendation level
check if set, check if valid
3.
recommendation list
format ID + domain + level
check if repeated/'similar'
2.2. What else it might contain (and/or how it can get arranged)
There are three main hubs of information that may either get edited or where something external may change 'spoiling' them in some way:
format descriptions under data/formats/
recommendations under data/recommendations/
center descriptions under data/centres.xml
We thus get three targets for sanity checks, and one should bear in mind that the middle one, recommendations, can be checked for internal coherence but also for whether the associations that it makes (between centres and formats and properties defined inside recommendations) are coherent.
2.2.1. Context: formats
items (4) and (5) from section 1 above (extensions, media types)
format families (this is a separate can of worms that requires at least one separate ticket)
stuff someone might forget to change after using one format description as a template for another:
repeated IDs,
repeated names, abbreviations
extId pointers?
check for similarity of format IDs across the entire set of formats (capitalization, hyphenation, etc... partial matching?)
2.2.2. Context: recommendations and the associations that they create
Internally to the list of recommendations
much of that is already handled in section 2.1 above (and implemented as of Feb 2022)
2.1. also includes item (3), which is about ascribing properties to formats (or rather to format IDs)
new: check for similarity of format IDs across the entire set of recommendations (capitalization, hyphenation, etc.)
Associations defined by the recommendations
the association with centres is not yet handled (perhaps someone uses a recommendation file for one centre as a template for populating another one?)
2.2.3. Context: centres
centres are associated with the CLARIN database via links -- should we check if these links are live?
we might want to check for repeated centres or repeated links in different centre elements
we might want to diagnose (somehow) the RI status of a centre
within CLARIN, we might want to check the sanity of status indicators, although part of that should be handled by the schema (separate ticket!)
Please kindly add ideas / comments below.
Please don't assign this ticket, and it probably doesn't make sense to put it into a milestone either.
Other, real action tickets, should mention this one if they concern implementing some of the above ideas.
Clicking on the "sanity checker" label should show the current state of issues concerning the mental health of the system.
Here are some recent tickets:
This ticket will probably stay outside of milestones, unless we see a straight path ahead to closing it, for a time. The previous ticket of this nature got turned into an action ticket (#60), and now this one is supposed to be the new place to gather ideas and turn them into separate action tickets.
1. Introduction: places to look for sanity checks
We have a somewhat distributed sanity-checking functionality now (and that happened in a way by design):
This very ticket is meant for the content of the sanity-checker page.
2. Sanity-checker page
This page should eventually have structured logic and probably repeat some of the distributed information (which, given the modular structure of the SIS, should be trivial).
2.1. What it contains
2.2. What else it might contain (and/or how it can get arranged)
There are three main hubs of information that may either get edited or where something external may change 'spoiling' them in some way:
data/formats/
data/recommendations/
data/centres.xml
We thus get three targets for sanity checks, and one should bear in mind that the middle one, recommendations, can be checked for internal coherence but also for whether the associations that it makes (between centres and formats and properties defined inside recommendations) are coherent.
2.2.1. Context: formats
2.2.2. Context: recommendations and the associations that they create
Internally to the list of recommendations
Associations defined by the recommendations
2.2.3. Context: centres
Please kindly add ideas / comments below. Please don't assign this ticket, and it probably doesn't make sense to put it into a milestone either. Other, real action tickets, should mention this one if they concern implementing some of the above ideas.