if there's a nested object that is sparse in nature, meaning occurs infrequently in data. Then for that data the inferred schema would be wrong.
eg.
correct schema: ["attributes"]["teams"]["assigned"] = list<item: struct>>
but if there's no data for the above attributes, due to how schema is currently inferred here the schema becomes:
list which can be tricky if the files are then being loaded to spark for downstream consumption.
I see there's a comment already to build schema from json schema rather than inferring it which I believe is the right way.
Just opening this issue, so I can pick it up later.
Hey @prakharcode 👋 -- yes, I believe this is the same as #17. You could check out this branch which has some starter code for validating and creating the data fame based on inputs. I never quite got it where I wanted it.
if there's a nested object that is sparse in nature, meaning occurs infrequently in data. Then for that data the inferred schema would be wrong.
eg. correct schema: ["attributes"]["teams"]["assigned"] = list<item: struct>>
but if there's no data for the above attributes, due to how schema is currently inferred here the schema becomes:
list which can be tricky if the files are then being loaded to spark for downstream consumption.
I see there's a comment already to build schema from json schema rather than inferring it which I believe is the right way.
Just opening this issue, so I can pick it up later.