aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
600 stars 151 forks source link

[BUG]: JSON serializer doesn't produce valid json (table) #409

Closed OneCyrus closed 4 months ago

OneCyrus commented 1 year ago

Library Version

4.16.4

OS

Windows 11

OS Architecture

64 bit

How to reproduce?

  1. Deserialize multiple rows
using var parquetReader = await ParquetReader.CreateAsync(stream);
var table = await parquetReader.ReadAsTableAsync();

var json = table.ToString("j");
  1. the JSON returned is basically multiple JSON on each line whioch is not valid. The serializer should return an array with comma delimited rows.

instead of this

{ "a": 1, "b": "test" }
{ "a": 2, "b": "test" }
{ "a": 3, "b": "test" }

we should get this

[
  { "a": 1, "b": "test" },
  { "a": 2, "b": "test" },
  { "a": 3, "b": "test" }
]

Failing test

using var parquetReader = await ParquetReader.CreateAsync(stream);
var table = await parquetReader.ReadAsTableAsync();

var json = table.ToString("j");

var myTable = JsonSerializer.Deserialize<dynamic>(json);
aloneguid commented 1 year ago

It's actually a feature :) It produces output in json-lines format, more commonly used in big data platforms like Apache Spark. Having one document with massive array does not allow for easy streaming of JSON.

OneCyrus commented 1 year ago

isn't that supposed to be the serialization of rows? when i serialize a table I would expect to have a representation of the table and not multiple rows. would be great to have at least an option to switch between the modes.

aloneguid commented 1 year ago

You can call ToString() for each row and append those with comma separator.

OneCyrus commented 1 year ago

that's what we did. basically we did it the same way like you did in the ToString() method but removed the limitation with the level filter.

unfortunately that was just a workaround for deserializing parquet to c# classes. as the parquet deserializer had too many limitations we convert the parquet first to json and use the system.text.json deserializer to get our c# objects.

aloneguid commented 1 year ago

The performance of this will be terrible :( What kind of limitations did you encounter with class serializer by the way?

OneCyrus commented 1 year ago

the most important issues we have are:

aloneguid commented 1 year ago

Fair enough, thanks. Regarding the first, it will change in future. I'm planning to unify code for row api and class serializer, which will make row api more stable, but also allow declaring properties of object or List<object> or Dictionary<string, object> types for exactly this use case - mixing strong typing with quack typing. As for the second, you can mark properties with [JsonPropertyName] attribute. I may add custom resolvers for case-insensitive match, underscore removal and so on. No promises on timing though.