parquet schema wrongly inferred

crowemi / target-s3

singer.io target for S3 built with @meltano SDK

9 stars 24 forks source link

if there's a nested object that is sparse in nature, meaning occurs infrequently in data. Then for that data the inferred schema would be wrong.

eg. correct schema: ["attributes"]["teams"]["assigned"] = list<item: struct>> but if there's no data for the above attributes, due to how schema is currently inferred here the schema becomes: list which can be tricky if the files are then being loaded to spark for downstream consumption.

I see there's a comment already to build schema from json schema rather than inferring it which I believe is the right way.

Just opening this issue, so I can pick it up later.

crowemi / target-s3

parquet schema wrongly inferred #23