Closed e-lo closed 1 year ago
Thinking out loud here.
We will probably need an "examples" folder at the top level of the repo. Within examples maybe a folder for each agency, recommend using agency NTDID for folder name?
There could be at least two distinct types:
Maybe the agency folder could contain:
Suppose a sample dataset fails validation, do we automatically create an issue? If the validation is part of a PR, do we block merging until validation passes, or is it added to a sample's known issues, or do we create a new issue for the sample + PR?
I really like your proposal, @botanize - I'm updating the issue description to reflect most of it.
Would you be able to submit a PR on this with the Metro Transit data along with documentation for "adding your data"?
I can work on the validation part once you have the data if you don't have time for that...just LMK
To consider (reflected in my updated issue description):
I believe we're already using INI for flake8, and yaml for mkdocs, so we should probably use TOML or XML for agency metadata just to round things out? J/K. YAML seems fine.
I like this as well. Can we adjust slightly so as not to limit to just transit agencies creating examples. Vendors, researchers, consultants, etc., all might have example data sets and scripts that they want to share, validate, etc.
not to limit to just transit agencies creating examples. Vendors, researchers, consultants, etc.,
I don't think the current structure is limited to just agencies submitting...but each submittal should have an associated transit service. So long as top-level folder is unique...
I attempted to validate a small data sample for vehicle_locations.
If I don't specify a schema it's happy.
frictionless validate samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv --limit-rows 2
# -----
# valid: samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv
# -----
## Summary
> reached row limit: 2
+--------------+----------------------------------------------------------+
| Name | Value |
+==============+==========================================================+
| File Place | samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv |
+--------------+----------------------------------------------------------+
| File Size | 16.4 kB |
+--------------+----------------------------------------------------------+
| Total Time | 0.061 Seconds |
+--------------+----------------------------------------------------------+
| Rows Checked | 2 |
+--------------+----------------------------------------------------------+
If I do, then it gives me errors for missing and incorrect labels, presumably because I don't have a field for every field in the spec, even though many are optional.
Does frictionless interpret optional to mean that the field must be present in the data table, but that it doesn't need to have values? That's very different from how most people (and GTFS) would interpret optional.
frictionless validate samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv --limit-rows 2 --schema spec/vehicle_locations.schema.json
# -------
# invalid: samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv
# -------
## Summary
> reached row limit: 2
+------------------+----------------------------------------------------------+
| Name | Value |
+==================+==========================================================+
| File Place | samples/50027-MetroTransitMN/TIDES/vehicle_locations.csv |
+------------------+----------------------------------------------------------+
| File Size | 16.4 kB |
+------------------+----------------------------------------------------------+
| Total Time | 0.022 Seconds |
+------------------+----------------------------------------------------------+
| Rows Checked | 2 |
+------------------+----------------------------------------------------------+
| Total Errors | 35 |
+------------------+----------------------------------------------------------+
| Missing Label | 6 |
+------------------+----------------------------------------------------------+
| Incorrect Label | 7 |
+------------------+----------------------------------------------------------+
| Constraint Error | 8 |
+------------------+----------------------------------------------------------+
| Type Error | 2 |
+------------------+----------------------------------------------------------+
| Missing Cell | 12 |
+------------------+----------------------------------------------------------+
## Errors
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| Row | Field | Type | Message |
+=======+=========+==================+============================================================================================+
| | 14 | missing-label | There is a missing label in the header's field "speed" at position "14" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 15 | missing-label | There is a missing label in the header's field "odometer" at position "15" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 16 | missing-label | There is a missing label in the header's field "schedule_deviation" at position "16" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 17 | missing-label | There is a missing label in the header's field "headway_deviation" at position "17" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 18 | missing-label | There is a missing label in the header's field "in_service" at position "18" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 19 | missing-label | There is a missing label in the header's field "schedule_relationship" at position "19" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 7 | incorrect-label | Label "latitude" in field device_id at position "7" does not match the field name in the |
| | | | schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 8 | incorrect-label | Label "longitude" in field stop_id at position "8" does not match the field name in the |
| | | | schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 9 | incorrect-label | Label "heading" in field current_status at position "9" does not match the field name in |
| | | | the schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 10 | incorrect-label | Label "speed" in field latitude at position "10" does not match the field name in the |
| | | | schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 11 | incorrect-label | Label "odometer" in field longitude at position "11" does not match the field name in the |
| | | | schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 12 | incorrect-label | Label "schedule_deviation" in field gps_quality at position "12" does not match the field |
| | | | name in the schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| | 13 | incorrect-label | Label "gps_quality" in field heading at position "13" does not match the field name in the |
| | | | schema |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 4 | constraint-error | The cell "" in row at position "2" and field "trip_id_performed" at position "4" does not |
| | | | conform to a constraint: constraint "required" is "True" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 5 | constraint-error | The cell "" in row at position "2" and field "stop_sequence" at position "5" does not |
| | | | conform to a constraint: constraint "required" is "True" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 12 | constraint-error | The cell "0" in row at position "2" and field "gps_quality" at position "12" does not |
| | | | conform to a constraint: constraint "enum" is "['Excellent', 'Good', 'Poor']" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 13 | type-error | Type error in the cell "Excellent" in row "2" and field "heading" at position "13": type |
| | | | is "number/default" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 14 | missing-cell | Row at position "2" has a missing cell in field "speed" at position "14" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 15 | missing-cell | Row at position "2" has a missing cell in field "odometer" at position "15" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 16 | missing-cell | Row at position "2" has a missing cell in field "schedule_deviation" at position "16" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 17 | missing-cell | Row at position "2" has a missing cell in field "headway_deviation" at position "17" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 18 | missing-cell | Row at position "2" has a missing cell in field "in_service" at position "18" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 2 | 19 | missing-cell | Row at position "2" has a missing cell in field "schedule_relationship" at position "19" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 4 | constraint-error | The cell "" in row at position "3" and field "trip_id_performed" at position "4" does not |
| | | | conform to a constraint: constraint "required" is "True" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 5 | constraint-error | The cell "" in row at position "3" and field "stop_sequence" at position "5" does not |
| | | | conform to a constraint: constraint "required" is "True" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 9 | constraint-error | The cell "0" in row at position "3" and field "current_status" at position "9" does not |
| | | | conform to a constraint: constraint "enum" is "['Incoming at', 'Stopped at', 'In transit |
| | | | to']" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 11 | constraint-error | The cell "18008" in row at position "3" and field "longitude" at position "11" does not |
| | | | conform to a constraint: constraint "maximum" is "180" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 12 | constraint-error | The cell "0" in row at position "3" and field "gps_quality" at position "12" does not |
| | | | conform to a constraint: constraint "enum" is "['Excellent', 'Good', 'Poor']" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 13 | type-error | Type error in the cell "Good" in row "3" and field "heading" at position "13": type is |
| | | | "number/default" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 14 | missing-cell | Row at position "3" has a missing cell in field "speed" at position "14" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 15 | missing-cell | Row at position "3" has a missing cell in field "odometer" at position "15" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 16 | missing-cell | Row at position "3" has a missing cell in field "schedule_deviation" at position "16" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 17 | missing-cell | Row at position "3" has a missing cell in field "headway_deviation" at position "17" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 18 | missing-cell | Row at position "3" has a missing cell in field "in_service" at position "18" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
| 3 | 19 | missing-cell | Row at position "3" has a missing cell in field "schedule_relationship" at position "19" |
+-------+---------+------------------+--------------------------------------------------------------------------------------------+
I filed a bug with frictionless framework: https://github.com/frictionlessdata/framework/issues/1258
Context
- Disagreement on the broad usefulness of user story 5 (and its child, user story 6) as discussed in the unresolved PR comments, but summarized as
- As a smaller transit agency or one that isn't yet automatically generating data, I'd like a template folder that can be used for quickly prototyping my TIDES data.
- As a TIDES maintainer, I'd like templates to automatically update based on updates to the spec
datapackage.json
contains a lot of important information about data expectations which is not validated anywhere since we are not currently using atabular-data-package
profile rather than developing our own profile which could enforce some of these conventions.Options
- that are auto-generated from the spec? (currently implemented in PR, preferred by @e-lo )
- as static files
datapackage.json
documented as static text in theREADME.md
, no csv templates (preferred by @botanize)
I think there's a fourth option here, which is related to the second, provide real example files, from either Metro Transit or CalITP.
This would match common practice of providing a working tutorial, test data or template config files that can be used as-is. They would contain real-world data, providing more insight for potential producers than empty, or fake data.
Any significant proposed change to the spec will show up as a validation error in the CI workflow, and could be resolved by the producer as part of the spec change PR, or in a separate PR.
I suggest dropping the templates from #75 and merging what we already agree on as outlined above. Then addressing the need for more documentation in a separate PR that provides example data. I've got a minimal one almost ready to go for Metro Transit, it's currently blocked by the frictionless foreignKey validation in #77, which could be resolved by #81.
Can we split this PR into at least three?
The first two PRs could be merged almost immediately with a few modifications of the contents of this PR:
Deprecating in favor of PR #100 and #101 and issue #93
User Stories
Describe the feature you want and how it meets your needs or solves a problem
Proposed Solution
List of Solutions for (relevant user story)
(Checking box indicates consensus achieved on approach)
datapackage.json
data with REPLACEME (or similar)datapackage.json
) which can be used to quickly generate new examples (5)BONUS (or potentially another issue/PR):
Proposed example directory structure:
Consensus Building
General Agreement
To Discuss
1 - Sources
How should data sources be documented?
Context
datapackage.json
allows users to specify where the data came from in thesources
field.sources
can be specified at thedata-package
orresource
level.source
listed insources
.Options
allow option for either: potential compromise (I think fine with both @botanize and @e-lo , but is less opinionated)
Discussed in the unresolved PR comments
note: this would only affect our documentation and template (if used, see below)
datapackage.json
since we are not developing (at this time) a datapackage profile which would validate this data.2 - Template Files
Should we have template files (csvs and
datapackage.json
) and if so, is it useful to have code that auto-generates them based on changes to the spec?Context
datapackage.json
contains a lot of important information about data expectations which is not validated anywhere since we are not currently using atabular-data-package
profile rather than developing our own profile which could enforce some of these conventions.Options
datapackage.json
documented as static text in theREADME.md
, no csv templates (preferred by @botanize)