Closed anatoliy-savchak closed 2 months ago
Adding CETAS for anybody interested:
drop table if exists #mytable;
create table #mytable(Id uniqueidentifier);
insert into #mytable(Id) values ('15A2501E-4899-4FF8-AF51-A1805FE0718F');
--drop external table cetas4;
create external table cetas4
with (
location = 'cetas4/',
data_source = DataSourceDatalake,
file_format = ParquetUncompressed
)
as
select top 1 * from #mytable;
can you provide an example of this parquet file
can you provide an example of this parquet file
Yes, sure: https://github.com/anatoliy-savchak/parquet-dotnet-guids/raw/main/cetas4.parquet
cetas4.zip thanks, attaching here for convenience
My current workaround:
public async Task WriteDataToParquetFile<T>(string relativePath, string fileName, IEnumerable<T> data)
{
if (BitConverter.IsLittleEndian)
{
var guidProperties = typeof(T).GetProperties().Where(p => p.PropertyType == typeof(Guid) || p.PropertyType == typeof(Guid?));
foreach (var prop in guidProperties)
{
foreach (var item in data)
{
var value = (Guid?)prop.GetValue(item);
if (value != null)
{
var guidBytes = value.Value.ToByteArray();
Array.Reverse(guidBytes, 0, 4);
Array.Reverse(guidBytes, 4, 2);
Array.Reverse(guidBytes, 6, 2);
var newValue = new Guid(guidBytes);
prop.SetValue(item, newValue);
}
}
}
}
// Serialize data to parquet file
using var memoryStream = new MemoryStream();
await ParquetSerializer.SerializeAsync(data, memoryStream);
memoryStream.Position = 0;
In addition .Net 8 has Guid constructor which has bigEndian arg.
Spark 3.2 can't even read this file:
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY (UUID)
although it was generated using recent version of parquet-cpp:
We can fix it here, but you might have issues with other tools. Is it worth converting GUID to string in mssql for export as a workaround util this is fixed?
Good point, thanks @aloneguid !
We typically save data from C# BE into parquet in Azure Data Storage, and then read from SQL via external datasource or openrowset. SQL Server / Azure SQL DB/Azure SQL Server Managed Instance / SQL Server DW / SQL serverless pool (Synapse) etc all use Poly Base engine inside, which loads / saves using big-endian.
We are moving storage into ADS and converting all DTO's Guids into string is somewhat tedious. Plus additional overhead of converting varchar into uniqueidentifier on SQL side.
Currently we are going to use workaround above, but I hope the fix with Options is easy enough so we could continue using your awesome library! :)
Btw, the Spark issue was supposed to be fixed: https://github.com/apache/iceberg/issues/4038
thanks. Actually specification explicitly says it should be encoded as big-endian, so it's a bug here.
@anatoliy-savchak does this look correct?
The fix just released in 4.23.5, hopefully it works for you.
@aloneguid amazing, thank you!!! :)
I'll pass it on to my Copilot (GitHub). No worries.
Library Version
4.23.4
OS
Windows
OS Architecture
64 bit
How to reproduce?
Failing test