clarin-eric / standards

work space for the Standards and Interoperability Committee
https://www.clarin.eu/content/standards
4 stars 15 forks source link

Sanity checker: general discussion ticket #115

Open bansp opened 2 years ago

bansp commented 2 years ago

This ticket will probably stay outside of milestones, unless we see a straight path ahead to closing it, for a time. The previous ticket of this nature got turned into an action ticket (#60), and now this one is supposed to be the new place to gather ideas and turn them into separate action tickets.

1. Introduction: places to look for sanity checks

We have a somewhat distributed sanity-checking functionality now (and that happened in a way by design):

  1. the main sanity-checker page (small at the time of writing, but it's already been useful)
  2. list of missing formats (mentioned in the recommendations by ID but the IDs belong to not-yet-existent format-description files), under "Data Deposition Formats"
  3. list of existing format-description files that are not mentioned by any recommendation -- also under "Data Deposition Formats"
  4. list of format-description files that don't mention any file extensions -- under "File Extensions"
  5. list of format-description files that don't mention any media types -- under "Media Types"
  6. and, indirectly, the Statistics page, which is meant (mainly) for aggregating and visualizing the data content, but indirectly might point to some local insanities ;-), especially because, for now, it contains some meta-statistics that still tell us more about the content of the SIS rather than the individual centres and formats. Note also that this page has a dedicated discussion issue (#67). Maybe it should even get split into something like "SIS statistics" and "Data Visualization", later on, but let's ignore that in the present issue.

This very ticket is meant for the content of the sanity-checker page.

2. Sanity-checker page

This page should eventually have structured logic and probably repeat some of the distributed information (which, given the modular structure of the SIS, should be trivial).

2.1. What it contains

Context Target Check
1. recommendation list domain name check if set, check if valid
2. recommendation list recommendation level check if set, check if valid
3. recommendation list format ID + domain + level check if repeated/'similar'

2.2. What else it might contain (and/or how it can get arranged)

There are three main hubs of information that may either get edited or where something external may change 'spoiling' them in some way:

  1. format descriptions under data/formats/
  2. recommendations under data/recommendations/
  3. center descriptions under data/centres.xml

We thus get three targets for sanity checks, and one should bear in mind that the middle one, recommendations, can be checked for internal coherence but also for whether the associations that it makes (between centres and formats and properties defined inside recommendations) are coherent.

2.2.1. Context: formats

2.2.2. Context: recommendations and the associations that they create

Internally to the list of recommendations
Associations defined by the recommendations

2.2.3. Context: centres

Please kindly add ideas / comments below. Please don't assign this ticket, and it probably doesn't make sense to put it into a milestone either. Other, real action tickets, should mention this one if they concern implementing some of the above ideas.

bansp commented 1 year ago
bansp commented 5 months ago

Clicking on the "sanity checker" label should show the current state of issues concerning the mental health of the system. Here are some recent tickets: