Open westurner opened 1 year ago
FWIW, re: data validation these days:pydantic_schemaorg validates with schema.org schema, and there's QuantitativeValue
[Distribution
], CSVW (CSV on the Web) is a standard for CSV in RDF, RDF has many representations: RDF/XML, Turtle (.ttl
), JSON-LD (.json
, application/ld+json
), RDFa (RDF-in-(HTML)-Attributes), some applications - including search engines - work with at least bibliographic linked data like for subtypes of https://schema.org/CreativeWork such as https://schema.org/ScholarlyArticle and :Dataset and :DataCatalog.
Other existing standards for data schema and/or validation: SDMX (pandaSDMX,), W3C Data Cubes (pandas-datacube,), JSONschema (pydantic, react-jsonschema-form,) and W3C SHACL (Schema.org,)
https://github.com/lexiq-legal/pydantic_schemaorg generates templated pydantic .py
source files containing validators for all of the rdfs:Class
and rdfs:Property
defined in a release of the https://schema.org/ meta-vocabulary
For example:
W3C SHACL
[ ] https://github.com/pandas-dev/pandas/issues/3402
DataFrame.attrs
is a dict that anything can or could modify upon read, transformation, or write; and may not be persisted by file formats that do not support an auxiliary metadata fileclass DataFrameWithNonAttrsMetadata(pd.DataFrame):
_metadata = ["additional_attrs", "prov"]
W3C PROV is a Linked Data specification for specifying data provenance information: who, what, when, how, etc.
What does that mean for pandas and dataclasses and pyarrow and optionally pydantic?
What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?
Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in
dataclasses.dataclass
replacement atpydantic.dataclasses.dataclass
.pd.read_*(**, dtype_backend="pyarrow")
https://github.com/pydantic/pydantic/blob/main/docs/usage/dataclasses.md
@pydantic.dataclasses.dataclass