aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

Issues converting to CSV #475

Open Haeri opened 4 months ago

Haeri commented 4 months ago

Issue description

I am trying to find a good way to convert my parquet file into a CSV file. I've found mention of CSV format in Row.cs with the toString function.

By iterating through all rows I was able to get some initial success:

private string ConvertTableToCsv(Table table)
{
    int maxRows = 10000000;

    using (var writer = new StringWriter())
    {
        int i = 0;
        foreach (Row row in table)
        {
            if (i >= maxRows)
                break;
            string csvRow = row.ToString("c", i);
            writer.WriteLine(csvRow);
            i++;
        }
        return writer.ToString();
    }
}

However, there are a lot of edge cases, that are not handled properly, such as commas in strings, arrays, structs etc. I've created a test dataset with different types and exported it once with pandas.to_csv and with the aforementioned function:

parquet-dotnet: toString("c")

string,object,int64,int32,int16,int8,uint8,float64,float32,bool,datetime64[ns],timedelta64[ns],category,array,dict
apple,a,1,4,7,7,10,1.1,4.4,True,1672531200000000000,86400000000000,cat,[4.4,5.5,6.6],1,y
ba"na"na,b,2,5,8,8,11,2.2,5.5,False,1675209600000000000,172800000000000,dog,[4.4,5.5,6.6],1,y
orange,orange,c,3,6,9,9,12,3.3,6.6,True,1677628800000000000,259200000000000,bird,[4.4,5.5,6.6],1,y

pandas: to_csv

string,object,int64,int32,int16,int8,uint8,float64,float32,bool,datetime64[ns],timedelta64[ns],category,array,dict
apple,a,1,4,7,7,10,1.1,4.4,True,2023-01-01,1 days,cat,[4.4 5.5 6.6],"{'x': 1, 'y': 'y'}"
"ba""na""na",b,2,5,8,8,11,2.2,5.5,False,2023-02-01,2 days,dog,[4.4 5.5 6.6],"{'x': 1, 'y': 'y'}"
"orange,orange",c,3,6,9,9,12,3.3,6.6,True,2023-03-01,3 days,bird,[4.4 5.5 6.6],"{'x': 1, 'y': 'y'}"

issues

  1. quotations marks inside a cell are detected and escaped: ba"na"na -> "ba""na""na"
  2. commas inside a cell are detected and the entire cell is put into quotation marks: orange,orange ->"orange,orange"
  3. arrays are not comma separated but instead separated with a space: [4.4,5.5,6.6] -> [4.4 5.5 6.6]
  4. dicts don't seem to work at all

Would appreciate any advice if there is a better way to convert to CSV without me having to manually preprocess the data.

aloneguid commented 4 months ago

This library does not convert to CSV as it's outside of it's scope. But as a simple example have a look at https://github.com/aloneguid/parquet-dotnet/blob/master/src/Parquet.Floor/Controllers/ParquetToCsvConverter.cs. As you mentioned, CSV has a lot of edge cases and you would need an external library like CSVHelper to safely handle that.

aloneguid commented 4 months ago

Also see https://aloneguid.github.io/parquet-dotnet/utilities.html#converting-to-flat-format

Haeri commented 4 months ago

Ok thanks a lot! The CSV looks a lot better now. However, the dict/struct type is completely being omitted now. Not sure why this is happening.

Using pandas to write parquet file with the following data:

data = {
    'string':["apple", "ba\"na\"na", "orange,orange"],
    'object': ['a', 'b', 'c'],
    'int64': np.array([1, 2, 3], dtype='int64'),
    'int32': np.array([4, 5, 6], dtype='int32'),
    'int16': np.array([7, 8, 9], dtype='int16'),
    'int8': np.array([7, 8, 9], dtype='int8'),
    'uint8': np.array([10, 11, 12], dtype='uint8'),
    'float64': np.array([1.1, 2.2, 3.3], dtype='float64'),
    'float32': np.array([4.4, 5.5, 6.6], dtype='float32'),
    'bool': [True, False, True],
    'datetime64[ns]': pd.to_datetime(['2023-01-01', '2023-02-01', '2023-03-01']),
    'timedelta64[ns]': pd.to_timedelta(['1 days', '2 days', '3 days']),
    'category': pd.Categorical(['cat', 'dog', 'bird']),
    'array[float]': [[4.4, 5.5, 6.6], [4.4, 5.5, 6.6], [4.4, 5.5, 6.6]],
    'array[string]': [["hello", "world", "helloworld"], ["peter parker", "peter,parker", "peter\"parker\""], ["", ",", " "]],
    'dict': [{'x': 1, 'y': 'y'}, {'x': 1, 'y': 'y'}, {'x': 1, 'y': 'y'}],
}

Here is the parquet file with the viewer: image

And this is the output when converting to CSV

string,object,int64,int32,int16,int8,uint8,float64,float32,bool,datetime64[ns],timedelta64[ns],category,array[float],array[string],dict
apple,a,1,4,7,7,10,1.1,4.4,True,1672531200000000000,86400000000000,cat,4.4;5.5;6.6,hello;world;helloworld
"ba""na""na",b,2,5,8,8,11,2.2,5.5,False,1675209600000000000,172800000000000,dog,4.4;5.5;6.6,"peter parker;peter,parker;peter""parker"""
"orange,orange",c,3,6,9,9,12,3.3,6.6,True,1677628800000000000,259200000000000,bird,4.4;5.5;6.6,";,; "

As you can see, "dict" header is created but no data is written

aloneguid commented 4 months ago

This is expected, as CSV does not support complex types. There is no standard that can encode dictionaries to text. CSV is very dumb and plain, so that's probably the best we can do without some pre-processing.

Haeri commented 4 months ago

Ok I see, thanks for the info! Still a bit surprising to learn that complex types just disappear rather than at least converting to JSON string. Could you give me some hints on how I could do the JSON conversion for complex types on my own?

On a related note: I am also reading the parquet file and sending it to my frontend application to be displayed in a table. There I am having a similar issue. Right now I am manually reading everything and putting it into a Dictionary. Is there maybe an easier and more robust way? The types don't have to match. Having everything as flattened string would work fine since I only have to display it in my html table.

Here is my code:

string datasetPath = "x.parquet"
if (!File.Exists(datasetPath))
{
    throw new Exception($"Resource not found.");
}

using Stream fs = File.OpenRead(datasetPath);
using ParquetReader reader = await ParquetReader.CreateAsync(fs);
long totalRowsRead = 0;

Dictionary<string, ColumnData<object>> structuredData = [];

for (int i = 0; i < reader.RowGroupCount; i++)
{

    using (ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(i))
    {

        if (totalRowsRead > 200) break;

        CustomMetadata? customMetadata = JsonConvert.DeserializeObject<CustomMetadata>(reader.CustomMetadata["pandas"]);
        Dictionary<string, string> columnNumpyTypes = customMetadata?.columns.ToDictionary(col => col.name, col => col.numpy_type) ?? [];

        Dictionary<string, ColumnStatistics>? columnMetadata = JsonConvert.DeserializeObject<Dictionary<string, ColumnStatistics>>(reader.CustomMetadata.GetValueOrDefault("column_statistics", ""));

        foreach (DataField df in reader.Schema.GetDataFields())
        {
            DataColumn dataColumn = await rowGroupReader.ReadColumnAsync(df);

            List<object> columnList = new(dataColumn.Data.Cast<object>());
            if ((totalRowsRead + columnList.Count) > 200)
            {
                int alowed = (int)(totalRowsRead + columnList.Count - 200);
                columnList.RemoveRange(200, columnList.Count - 200);
            }

            ColumnData<object> columnData = new()
            {
                Name = df.Path.ToString(),
                TypeName = columnNumpyTypes.GetValueOrDefault(df.Name, "undefined"),
                Statistics = new Statistics()
                {
                    DistinctCount = columnMetadata!.GetValueOrDefault(df.Name, null)?.unique_count ?? 0,
                    MaxValue = columnMetadata!.GetValueOrDefault(df.Name, null)?.max_value,
                    MinValue = columnMetadata!.GetValueOrDefault(df.Name, null)?.min_value,
                    NullCount = columnMetadata!.GetValueOrDefault(df.Name, null)?.num_nulls ?? 0,
                },
                Data = columnList
            };

            structuredData.Add(df.Path.ToString(), columnData);
        }

        totalRowsRead += rowGroupReader.RowCount;
    }
}

return structuredData;

But this gives me an incorrect number of columns, 2d arrays are flattened and therefore contain more rows than expected, structs are separated into multiple columns etc image

aloneguid commented 1 month ago

Hi, sorry I missed this. You can already deserialize to untyped dictionaries. As an example, please look at the included Floor utility (desktop parquet viewer) which uses this approach to read any file with unknown schema.