[feature request / refactor] Produce structured data containing human-readable code descriptions (refactor RULES.md)

lauriemerrell commented 1 year ago

Describe the problem

Cal-ITP produces https://reports.calitp.org/, where we report on various aspects of GTFS data quality. One of the things we currently display on the site is a grid of validator notices output for a given feed in a given month. We like to display a human-readable notice description so that the notice can be understood by agencies and the general public, who may not be familiar with validator code names.

Currently, to update those human readable descriptions, we have to manually scrape the data from RULES.md for each validator version and turn it into a CSV that we can import through our pipeline.

To make the CSV, I:

Regex'd the .md file to extract the code with its simple description
Manually annotated with rule severity, because the current format doesn't actually contain a table with code, description, severity in one place (the severity is just indicated in the title of the table, which makes it harder to scrape)
Manually removed Markdown and HTML (RULES.md uses an inconsistent mixture of both)

This also opens up issues like #1322 because RULES.md is maintained separately as a text file and not related to the actual validator code.

It would be nice if the human readable description about rule implementation were available as structured data (CSV or JSON) and could be output by the validator itself, rather than requiring reference to the RULES.md file (analogous to the new notice_schema.json file that can be output by the JAR).

Proposed solution

Rule descriptions could be attributes within the rule implementation itself, and then RULES.md could be programmatically generated based on those attributes, rather than RULES.md being the source of truth but maintained separately.

Alternatives you've considered

No response

Additional context

It would be really nice to have something like code, severity, short_desc, detailed_desc, formatted_desc where formatted could contain Markdown (RULES.md has a shorter rule description in the tables at the top and then a slightly longer description below.)

lauriemerrell commented 1 year ago

I think that more structured of this type would also be useful for the web-based validator UI, which currently just links to RULES.md to explain rule meaning. If the rule descriptions were in the rule code, presumably the web UI could display them more easily. But I defer to @KClough et al. on those considerations.

themightychris commented 1 year ago

The index.js file in this PR (click to expand the collapsed-by-default file) contained some improved language describing a lot of errors. It's been in the collective mental backlog to capture those somewhere. If there's an effort to capture the rules into a more machine-readable format that might be a good moment to gobble those up

bdferris-v2 commented 1 year ago

I've got two updates here:

PR #1327 has a proof-of-concept demo of extracting notice documentation from source code. I'm not sure if it's exactly what we should do, so...
I wrote up some more thoughts on the potential design options at https://bit.ly/gtfs-validator-notice-documentation. Feedback appreciated!

derhuerst commented 1 year ago

I wrote up some more thoughts on the potential design options at https://bit.ly/gtfs-validator-notice-documentation. Feedback appreciated!

I'm not familiar with Java tooling at all, but maybe it could also be done the other way around: Checked-in JSON files act as the "source of truth" for the notices, they are machine-readable already anyways; An notice implementation (Java file) would read the corresponding JSON file and use its severity, summary, description, etc.; The RULES.md file could be generated from the JSON files using a simple script.

bdferris-v2 commented 1 year ago

@derhuerst I'm a little worried about the separation between the documentation and code in that case. Specifically, for the individual fields of the notice, which we'd need to define in two places. If we really did go with a JSON representation, I'd vote to generate source code for the Notices from the JSON itself. However, I wonder if we'll run into roadblocks there (e.g. we need custom Java methods on our notice to assist in type-conversion + construction that are awkward to encode/generate from JSON). My hunch is that it's easier to go from code => JSON, but I'd be interested in hearing other arguments for and against.

KClough commented 1 year ago

@bdferris-v2 I agree.

derhuerst commented 1 year ago

From my point of view – as someone who wants to use gtfs-validator and interpret the results in an automated way –, as long as there's a reasonably easy way to generate a JSON artefact (or anything machine-readable really) containing the rules, I don't mind. 👍

bdferris-v2 commented 1 year ago

Not actually done yet.

emmambd commented 10 months ago

Resolved in v4.2: https://github.com/MobilityData/gtfs-validator/releases/tag/v4.2.0. All PRs referenced under Generate documentation automatically heading.

MobilityData / gtfs-validator