Open bschifferer opened 1 year ago
Easy way to create a schema file (from existing parquet files) without running NVT
When you load a Parquet dataset with merlin.io.Dataset
it'll do its best to load an existing schema or infer a schema for you, which doesn't require running an NVT Workflow but also only gives you a bare minimum schema with column names and dtypes (which is all we can infer that way.)
Easy way to modify a schema file It would be great to convert schema object to JSON and/or loading a schema file from JSON
This is already possible via TensorflowMetadata.to_json()
@karlhigley not sure I understood your comment of load an existing schema
. Where will this schema come from if we dont use NVT workflow?
when we read a raw parquet file with merlin.io.Dataset
we get basically nothing. we get something like that:
I guess you mean we can export this minimal schema file to disk, so that we will have a schema.pbtxt
saved. but then, what about stats like cardinality etc? we should find an easy way to add them in the schema.pbtxt
file, and the tags for sure :)
Where will this schema come from if we dont use NVT workflow?
You can either write one in Python using the Schema
/ColumnSchema
API or write one in JSON by hand, and save the resulting file next to the Parquet file(s) on disk.
when we read a raw parquet file with merlin.io.Dataset we get basically nothing.
Yes, what you've shown above is the most complete schema we can infer from the Parquet files without actually reading the full dataset and running operators that compute stats over it. As above, if you want to add more to that without running a Workflow, you can either annotate the Schema
in Python and save it to disk, or save the schema as a JSON file and hand-edit the file.
but then, what about stats like cardinality etc? we should find an easy way to add them in the schema.pbtxt file, and the tags for sure :)
I mean...we have a pretty easy way to add them, which is to run a Workflow that computes stats over the dataset. I don't think there's an easy way to do it that doesn't boil down to "hand-edit a file to include information you already know" or "process the dataset to compute new information to include."
@karlhigley for
I mean...we have a pretty easy way to add them, which is to run a Workflow that computes stats over the dataset. I don't think there's an easy way to do it that doesn't boil down to "hand-edit a file to include information you already know" or "process the dataset to compute new information to include."
Indeed. we do that with workflow.fit(), put what'll happen in case users do not want to use workflow
at all..
Problem:
Merlin is a framework with many libraries in the ecosystem. Merlin is designed that the libraries are well connected. However, as a user, I want to be able to try out a library without many dependencies. For example, I want to try out Merlin Models without NVTabular. This is not possible as I require a schema file, which is not easy to create.
I use NVTabular for feature engineering, but I forgot to add a tag. My dataset is very large and I do not want to rerun the full pipeline to add a Tag to a column. There should be an easy way to modify a schema file without running the NVT pipeline.
Goal:
New Functionality
Example: It would be great to convert schema object to JSON and/or loading a schema file from JSON