Closed machow closed 3 years ago
Hi @machow ! thanks for the suggestion - do you see that yaml being generated for each different validation report (and system error report)? Or once and for all for each notices?
Hey--sorry for the wait. I was imagining it being generated once, and covering every possible notice. This way people could dump the schema once when setting up a database / dump it as a reference.
Hi @machow! We thought we could build such a schema using SQL (that is more generic) instead of BigQuery; would that meet your needs?
cc @barbeau
@lionel-nj I am not sure what that means--can you say more about what building a schema using SQL would look like?
For example, this is what a schema using SQL looks like...
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
column3 datatype,
....
);
These statements tend to be database / dialect dependent (esp. the types). You could dump it to json using type names that tools made to translate across dialects (e.g. sqlalchemy) use. That would allow people to construct these schemas across databases.
@machow Yes, our internal discussions were about trying to generalize a solution to be usable beyond just BigQuery. We're open to suggestions for how exactly that is implemented to be most useful to users across many different database systems.
Ah, that makes sense! Here are some more thoughts on the risks of a generic SQL-focused solution, and what seem like some of the core factors here. The one thing I'd emphasize is that my proposal attempts to bootstrap off bigquery's solution for representing these types of schemas, but the value IMO comes from having any reasonable JSON representation of the schema!
I think a simple JSON file will be most useful for these reasons:
Key questions for schema:
Example schema, highlighting need to represent sub-records (this is in yaml):
- name: currentDate
type: STRING
- name: feedEndDate
type: STRING
- name: suggestedExpirationDate
type: RECORD
fields:
- name: localDate
type: RECORD
fields:
- name: year
type: INTEGER
- name: month
type: INTEGER
- name: day
type: INTEGER
Feature request
Is your feature request related to a problem? Please describe.
Currently, the validator uses very clear schemas for structuring results internally. However, because it dumps results to a semi-structured JSON file, they're easy to read, but difficult to load into structured data sources (like SQL databases).
Proposed solution
Add an option to the CLI to dump a JSON file of notice schemas. Maybe something like this?:
Where the pattern here is...
Where SCHEMA_ENTRY is just copying BigQuery's schema format...
Describe alternatives you've considered
Currently, I'm just inferring the schema from a bunch of report data!