crowemi / target-s3

singer.io target for S3 built with @meltano SDK
https://hub.meltano.com/loaders/target-s3
9 stars 24 forks source link

parquet schema wrongly inferred #23

Closed prakharcode closed 11 months ago

prakharcode commented 1 year ago

if there's a nested object that is sparse in nature, meaning occurs infrequently in data. Then for that data the inferred schema would be wrong.

eg. correct schema: ["attributes"]["teams"]["assigned"] = list<item: struct>> but if there's no data for the above attributes, due to how schema is currently inferred here the schema becomes: list which can be tricky if the files are then being loaded to spark for downstream consumption.

I see there's a comment already to build schema from json schema rather than inferring it which I believe is the right way.

Just opening this issue, so I can pick it up later.

crowemi commented 1 year ago

Hey @prakharcode 👋 -- yes, I believe this is the same as #17. You could check out this branch which has some starter code for validating and creating the data fame based on inputs. I never quite got it where I wanted it.