TIDES-transit / TIDES

Transit ITS Data Exchange Specification for historical transit operations data
https://tides-transit.github.io/TIDES
Apache License 2.0
25 stars 4 forks source link

🚀💻 – Validate sample data against TIDES frictionless specification #40

Closed e-lo closed 1 year ago

e-lo commented 2 years ago

User Stories

Describe the feature you want and how it meets your needs or solves a problem

  1. As a transit agency, I'd like to know how to add my sample data.
  2. As a transit agency, I want to know if my data and the scripts that support them conforms to the latests TIDES spec so that I can make any necessary fixes.
  3. As a TIDES contributor, I'd like to know how proposed spec changes affect actual data so I can evaluate their ROI.
  4. As a transit agency or transit technology developer, I'd like to easily review the sample data, scripts, and the context in which they were developed.
  5. As a smaller transit agency or one that isn't yet automatically generating data, I'd like a template folder that can be used for quickly prototyping my TIDES data.
  6. As a TIDES maintainer, I'd like templates to automatically update based on updates to the spec

Proposed Solution

List of Solutions for (relevant user story)

(Checking box indicates consensus achieved on approach)

BONUS (or potentially another issue/PR):

Proposed example directory structure:

/samples
    /agency-name     # Unique agency name
        datapackage.json  # Basic example information such as agency-name, CAD-AVL vendor, spec version, data maintainer (and their GH handle) 
        /TIDES       # Data formatted in TIDES standard (in the future, we could have subfolders for versions if necessary)
        /raw    # Raw input data
        /scripts     # Scripts that turn data-raw to TIDES

Consensus Building

General Agreement

To Discuss

1 - Sources

How should data sources be documented?

Context

Options

  1. resource-level: can relate different resources (i.e. fares vs APC) to different sources (preferred by @e-lo)
  2. datapackage-level: simplifying and reducing the data that must be entered and replicated (preferred by @botanize)
  3. allow option for either: potential compromise (I think fine with both @botanize and @e-lo , but is less opinionated)

    Discussed in the unresolved PR comments

note: this would only affect our documentation and template (if used, see below) datapackage.json since we are not developing (at this time) a datapackage profile which would validate this data.

2 - Template Files

Should we have template files (csvs and datapackage.json) and if so, is it useful to have code that auto-generates them based on changes to the spec?

Context

  1. As a smaller transit agency or one that isn't yet automatically generating data, I'd like a template folder that can be used for quickly prototyping my TIDES data.
  2. As a TIDES maintainer, I'd like templates to automatically update based on updates to the spec

Options

  1. that are auto-generated from the spec? (currently implemented in PR, preferred by @e-lo )
  2. as static files
  3. datapackage.json documented as static text in the README.md, no csv templates (preferred by @botanize)
botanize commented 1 year ago

Thinking out loud here.

We will probably need an "examples" folder at the top level of the repo. Within examples maybe a folder for each agency, recommend using agency NTDID for folder name?

There could be at least two distinct types:

  1. raw-ish data + processing code
  2. output in TIDES format

Maybe the agency folder could contain:

Suppose a sample dataset fails validation, do we automatically create an issue? If the validation is part of a PR, do we block merging until validation passes, or is it added to a sample's known issues, or do we create a new issue for the sample + PR?

e-lo commented 1 year ago

I really like your proposal, @botanize - I'm updating the issue description to reflect most of it.

Would you be able to submit a PR on this with the Metro Transit data along with documentation for "adding your data"?

I can work on the validation part once you have the data if you don't have time for that...just LMK

e-lo commented 1 year ago

To consider (reflected in my updated issue description):

botanize commented 1 year ago

I believe we're already using INI for flake8, and yaml for mkdocs, so we should probably use TOML or XML for agency metadata just to round things out? J/K. YAML seems fine.

jlstpaul commented 1 year ago

I like this as well. Can we adjust slightly so as not to limit to just transit agencies creating examples. Vendors, researchers, consultants, etc., all might have example data sets and scripts that they want to share, validate, etc.

e-lo commented 1 year ago

not to limit to just transit agencies creating examples. Vendors, researchers, consultants, etc.,

I don't think the current structure is limited to just agencies submitting...but each submittal should have an associated transit service. So long as top-level folder is unique...

botanize commented 1 year ago

I attempted to validate a small data sample for vehicle_locations.

If I don't specify a schema it's happy.

frictionless validate samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv --limit-rows 2

# -----
# valid: samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv 
# -----

## Summary 

> reached row limit: 2

+--------------+----------------------------------------------------------+
| Name         | Value                                                    |
+==============+==========================================================+
| File Place   | samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv |
+--------------+----------------------------------------------------------+
| File Size    | 16.4 kB                                                  |
+--------------+----------------------------------------------------------+
| Total Time   | 0.061 Seconds                                            |
+--------------+----------------------------------------------------------+
| Rows Checked | 2                                                        |
+--------------+----------------------------------------------------------+

If I do, then it gives me errors for missing and incorrect labels, presumably because I don't have a field for every field in the spec, even though many are optional.

Does frictionless interpret optional to mean that the field must be present in the data table, but that it doesn't need to have values? That's very different from how most people (and GTFS) would interpret optional.

frictionless validate samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv --limit-rows 2 --schema spec/vehicle_locations.schema.json

# -------
# invalid: samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv 
# -------

## Summary 

> reached row limit: 2

+------------------+----------------------------------------------------------+
| Name             | Value                                                    |
+==================+==========================================================+
| File Place       | samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv |
+------------------+----------------------------------------------------------+
| File Size        | 16.4 kB                                                  |
+------------------+----------------------------------------------------------+
| Total Time       | 0.022 Seconds                                            |
+------------------+----------------------------------------------------------+
| Rows Checked     | 2                                                        |
+------------------+----------------------------------------------------------+
| Total Errors     | 35                                                       |
+------------------+----------------------------------------------------------+
| Missing Label    | 6                                                        |
+------------------+----------------------------------------------------------+
| Incorrect Label  | 7                                                        |
+------------------+----------------------------------------------------------+
| Constraint Error | 8                                                        |
+------------------+----------------------------------------------------------+
| Type Error       | 2                                                        |
+------------------+----------------------------------------------------------+
| Missing Cell     | 12                                                       |
+------------------+----------------------------------------------------------+

## Errors 

+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| Row   |   Field | Type             | Message                                                                                    |
+=======+=========+==================+============================================================================================+
|       |      14 | missing-label    | There is a missing label in the header's field "speed" at position "14"                    |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      15 | missing-label    | There is a missing label in the header's field "odometer" at position "15"                 |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      16 | missing-label    | There is a missing label in the header's field "schedule_deviation" at position "16"       |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      17 | missing-label    | There is a missing label in the header's field "headway_deviation" at position "17"        |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      18 | missing-label    | There is a missing label in the header's field "in_service" at position "18"               |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      19 | missing-label    | There is a missing label in the header's field "schedule_relationship" at position "19"    |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |       7 | incorrect-label  | Label "latitude" in field device_id at position "7" does not match the field name in the   |
|       |         |                  | schema                                                                                     |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |       8 | incorrect-label  | Label "longitude" in field stop_id at position "8" does not match the field name in the    |
|       |         |                  | schema                                                                                     |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |       9 | incorrect-label  | Label "heading" in field current_status at position "9" does not match the field name in   |
|       |         |                  | the schema                                                                                 |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      10 | incorrect-label  | Label "speed" in field latitude at position "10" does not match the field name in the      |
|       |         |                  | schema                                                                                     |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      11 | incorrect-label  | Label "odometer" in field longitude at position "11" does not match the field name in the  |
|       |         |                  | schema                                                                                     |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      12 | incorrect-label  | Label "schedule_deviation" in field gps_quality at position "12" does not match the field  |
|       |         |                  | name in the schema                                                                         |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
|       |      13 | incorrect-label  | Label "gps_quality" in field heading at position "13" does not match the field name in the |
|       |         |                  | schema                                                                                     |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |       4 | constraint-error | The cell "" in row at position "2" and field "trip_id_performed" at position "4" does not  |
|       |         |                  | conform to a constraint: constraint "required" is "True"                                   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |       5 | constraint-error | The cell "" in row at position "2" and field "stop_sequence" at position "5" does not      |
|       |         |                  | conform to a constraint: constraint "required" is "True"                                   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      12 | constraint-error | The cell "0" in row at position "2" and field "gps_quality" at position "12" does not      |
|       |         |                  | conform to a constraint: constraint "enum" is "['Excellent', 'Good', 'Poor']"              |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      13 | type-error       | Type error in the cell "Excellent" in row "2" and field "heading" at position "13": type   |
|       |         |                  | is "number/default"                                                                        |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      14 | missing-cell     | Row at position "2" has a missing cell in field "speed" at position "14"                   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      15 | missing-cell     | Row at position "2" has a missing cell in field "odometer" at position "15"                |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      16 | missing-cell     | Row at position "2" has a missing cell in field "schedule_deviation" at position "16"      |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      17 | missing-cell     | Row at position "2" has a missing cell in field "headway_deviation" at position "17"       |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      18 | missing-cell     | Row at position "2" has a missing cell in field "in_service" at position "18"              |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2     |      19 | missing-cell     | Row at position "2" has a missing cell in field "schedule_relationship" at position "19"   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |       4 | constraint-error | The cell "" in row at position "3" and field "trip_id_performed" at position "4" does not  |
|       |         |                  | conform to a constraint: constraint "required" is "True"                                   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |       5 | constraint-error | The cell "" in row at position "3" and field "stop_sequence" at position "5" does not      |
|       |         |                  | conform to a constraint: constraint "required" is "True"                                   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |       9 | constraint-error | The cell "0" in row at position "3" and field "current_status" at position "9" does not    |
|       |         |                  | conform to a constraint: constraint "enum" is "['Incoming at', 'Stopped at', 'In transit   |
|       |         |                  | to']"                                                                                      |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      11 | constraint-error | The cell "18008" in row at position "3" and field "longitude" at position "11" does not    |
|       |         |                  | conform to a constraint: constraint "maximum" is "180"                                     |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      12 | constraint-error | The cell "0" in row at position "3" and field "gps_quality" at position "12" does not      |
|       |         |                  | conform to a constraint: constraint "enum" is "['Excellent', 'Good', 'Poor']"              |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      13 | type-error       | Type error in the cell "Good" in row "3" and field "heading" at position "13": type is     |
|       |         |                  | "number/default"                                                                           |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      14 | missing-cell     | Row at position "3" has a missing cell in field "speed" at position "14"                   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      15 | missing-cell     | Row at position "3" has a missing cell in field "odometer" at position "15"                |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      16 | missing-cell     | Row at position "3" has a missing cell in field "schedule_deviation" at position "16"      |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      17 | missing-cell     | Row at position "3" has a missing cell in field "headway_deviation" at position "17"       |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      18 | missing-cell     | Row at position "3" has a missing cell in field "in_service" at position "18"              |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3     |      19 | missing-cell     | Row at position "3" has a missing cell in field "schedule_relationship" at position "19"   |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+

I filed a bug with frictionless framework: https://github.com/frictionlessdata/framework/issues/1258

botanize commented 1 year ago

Context

  1. As a smaller transit agency or one that isn't yet automatically generating data, I'd like a template folder that can be used for quickly prototyping my TIDES data.
  2. As a TIDES maintainer, I'd like templates to automatically update based on updates to the spec

Options

  1. that are auto-generated from the spec? (currently implemented in PR, preferred by @e-lo )
  2. as static files
  3. datapackage.json documented as static text in the README.md, no csv templates (preferred by @botanize)

I think there's a fourth option here, which is related to the second, provide real example files, from either Metro Transit or CalITP.

This would match common practice of providing a working tutorial, test data or template config files that can be used as-is. They would contain real-world data, providing more insight for potential producers than empty, or fake data.

Any significant proposed change to the spec will show up as a validation error in the CI workflow, and could be resolved by the producer as part of the spec change PR, or in a separate PR.

I suggest dropping the templates from #75 and merging what we already agree on as outlined above. Then addressing the need for more documentation in a separate PR that provides example data. I've got a minimal one almost ready to go for Metro Transit, it's currently blocked by the frictionless foreignKey validation in #77, which could be resolved by #81.

botanize commented 1 year ago

Can we split this PR into at least three?

The first two PRs could be merged almost immediately with a few modifications of the contents of this PR:

  1. Documentation of desired samples structure and generation for samples
    • README.md: replace reference to template with Data Package Spec
    • docs/index.md
    • contributors.md
    • docs/samples.md
    • main.py
    • mkdocs.yml
    • .markdownlint.yaml
    • requirements.txt
    • .gitignore
    • samples/README.md: remove template section
  2. GitHub workflow for sample validation
    • .github/workflows/validate-data.yml: maybe change name to validate-samples to match workflow name, update paths to match descriptions in documentation.
  3. Datapackage profile and documentation (#93)
e-lo commented 1 year ago

Deprecating in favor of PR #100 and #101 and issue #93