[RMP] Schema Editor / Creator

bschifferer commented 1 year ago

Problem:

Merlin is a framework with many libraries in the ecosystem. Merlin is designed that the libraries are well connected. However, as a user, I want to be able to try out a library without many dependencies. For example, I want to try out Merlin Models without NVTabular. This is not possible as I require a schema file, which is not easy to create.
I use NVTabular for feature engineering, but I forgot to add a tag. My dataset is very large and I do not want to rerun the full pipeline to add a Tag to a column. There should be an easy way to modify a schema file without running the NVT pipeline.

Goal:

[ ] Easy way to create a schema file (from existing parquet files) without running NVT
[ ] Easy way to modify a schema file

New Functionality

Example: It would be great to convert schema object to JSON and/or loading a schema file from JSON

karlhigley commented 1 year ago

Easy way to create a schema file (from existing parquet files) without running NVT

When you load a Parquet dataset with merlin.io.Dataset it'll do its best to load an existing schema or infer a schema for you, which doesn't require running an NVT Workflow but also only gives you a bare minimum schema with column names and dtypes (which is all we can infer that way.)

Easy way to modify a schema file It would be great to convert schema object to JSON and/or loading a schema file from JSON

This is already possible via TensorflowMetadata.to_json()

rnyak commented 1 year ago

@karlhigley not sure I understood your comment of load an existing schema. Where will this schema come from if we dont use NVT workflow?

when we read a raw parquet file with merlin.io.Dataset we get basically nothing. we get something like that:

I guess you mean we can export this minimal schema file to disk, so that we will have a schema.pbtxt saved. but then, what about stats like cardinality etc? we should find an easy way to add them in the schema.pbtxt file, and the tags for sure :)

karlhigley commented 1 year ago

Where will this schema come from if we dont use NVT workflow?

You can either write one in Python using the Schema/ColumnSchema API or write one in JSON by hand, and save the resulting file next to the Parquet file(s) on disk.

when we read a raw parquet file with merlin.io.Dataset we get basically nothing.

Yes, what you've shown above is the most complete schema we can infer from the Parquet files without actually reading the full dataset and running operators that compute stats over it. As above, if you want to add more to that without running a Workflow, you can either annotate the Schema in Python and save it to disk, or save the schema as a JSON file and hand-edit the file.

but then, what about stats like cardinality etc? we should find an easy way to add them in the schema.pbtxt file, and the tags for sure :)

I mean...we have a pretty easy way to add them, which is to run a Workflow that computes stats over the dataset. I don't think there's an easy way to do it that doesn't boil down to "hand-edit a file to include information you already know" or "process the dataset to compute new information to include."

rnyak commented 1 year ago

@karlhigley for

I mean...we have a pretty easy way to add them, which is to run a Workflow that computes stats over the dataset. I don't think there's an easy way to do it that doesn't boil down to "hand-edit a file to include information you already know" or "process the dataset to compute new information to include."

Indeed. we do that with workflow.fit(), put what'll happen in case users do not want to use workflow at all..

NVIDIA-Merlin / Merlin