aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

[BUG]: Big Endian Guid problem #496

Closed anatoliy-savchak closed 2 months ago

anatoliy-savchak commented 3 months ago

Library Version

4.23.4

OS

Windows

OS Architecture

64 bit

How to reproduce?

  1. Have parquet file, which contains Guid fields generated by other means like SQL Server CETAS. These are Big Endian guids.
  2. Deserialize. Guid values are incorrect.
  3. The fix could be to add new option => UseBigEndianForGuids

Failing test

C# Script in Polyglot Notebooks. But it works same in C# Console app:
#r "nuget:Parquet.Net" // remove for C# Console app

var msg = await Task.Run(() => (new System.Net.Http.HttpClient()).GetAsync("https://github.com/anatoliy-savchak/parquet-dotnet-guids/raw/main/cetas4.parquet"));
Parquet.Rows.Table tab = await Parquet.ParquetReader.ReadTableFromStreamAsync(msg.Content.ReadAsStream());
Console.WriteLine("{'Id': 15A2501E-4899-4FF8-AF51-A1805FE0718F} - Expected, but found:");
Console.WriteLine(tab.ToString());
anatoliy-savchak commented 3 months ago

Adding CETAS for anybody interested:

drop table if exists #mytable;
create table #mytable(Id uniqueidentifier);
insert into #mytable(Id) values ('15A2501E-4899-4FF8-AF51-A1805FE0718F');
--drop external table cetas4;
create external table cetas4
with (
    location = 'cetas4/',
    data_source = DataSourceDatalake,
    file_format = ParquetUncompressed
)
as
select top 1 * from #mytable;
aloneguid commented 3 months ago

can you provide an example of this parquet file

anatoliy-savchak commented 3 months ago

can you provide an example of this parquet file

Yes, sure: https://github.com/anatoliy-savchak/parquet-dotnet-guids/raw/main/cetas4.parquet

aloneguid commented 3 months ago

cetas4.zip thanks, attaching here for convenience

anatoliy-savchak commented 3 months ago

My current workaround:

    public async Task WriteDataToParquetFile<T>(string relativePath, string fileName, IEnumerable<T> data)
    {
        if (BitConverter.IsLittleEndian)
        {
            var guidProperties = typeof(T).GetProperties().Where(p => p.PropertyType == typeof(Guid) || p.PropertyType == typeof(Guid?));
            foreach (var prop in guidProperties)
            {
                foreach (var item in data)
                {
                    var value = (Guid?)prop.GetValue(item);
                    if (value != null)
                    {
                        var guidBytes = value.Value.ToByteArray();
                        Array.Reverse(guidBytes, 0, 4);
                        Array.Reverse(guidBytes, 4, 2);
                        Array.Reverse(guidBytes, 6, 2);
                        var newValue = new Guid(guidBytes);
                        prop.SetValue(item, newValue);
                    }
                }
            }
        }

        // Serialize data to parquet file
        using var memoryStream = new MemoryStream();
        await ParquetSerializer.SerializeAsync(data, memoryStream);
        memoryStream.Position = 0;

In addition .Net 8 has Guid constructor which has bigEndian arg.

image

aloneguid commented 3 months ago

Spark 3.2 can't even read this file:

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY (UUID)

although it was generated using recent version of parquet-cpp:

image

We can fix it here, but you might have issues with other tools. Is it worth converting GUID to string in mssql for export as a workaround util this is fixed?

anatoliy-savchak commented 3 months ago

Good point, thanks @aloneguid !

We typically save data from C# BE into parquet in Azure Data Storage, and then read from SQL via external datasource or openrowset. SQL Server / Azure SQL DB/Azure SQL Server Managed Instance / SQL Server DW / SQL serverless pool (Synapse) etc all use Poly Base engine inside, which loads / saves using big-endian.

We are moving storage into ADS and converting all DTO's Guids into string is somewhat tedious. Plus additional overhead of converting varchar into uniqueidentifier on SQL side.

Currently we are going to use workaround above, but I hope the fix with Options is easy enough so we could continue using your awesome library! :)

anatoliy-savchak commented 3 months ago

Btw, the Spark issue was supposed to be fixed: https://github.com/apache/iceberg/issues/4038

aloneguid commented 3 months ago

thanks. Actually specification explicitly says it should be encoded as big-endian, so it's a bug here.

aloneguid commented 3 months ago

@anatoliy-savchak does this look correct?

image

aloneguid commented 3 months ago

The fix just released in 4.23.5, hopefully it works for you.

anatoliy-savchak commented 3 months ago

@aloneguid amazing, thank you!!! :)

aloneguid commented 3 months ago

I'll pass it on to my Copilot (GitHub). No worries.