elastacloud / parquet-usql

A custom extractor designed to read parquet for Azure Data Lake Analytics
MIT License
13 stars 5 forks source link

metadata corruption due to nullable/optional columns #30

Closed alexdeng closed 6 years ago

alexdeng commented 6 years ago

Expected Behaviour

columns can be nullable

Actual Behaviour

in the current implementation, _ds will be reset everytime flushdataset happens. Because the hasnull info in column stats that is used when metadata is written in the footer is only from the last batch (the last _ds), it could be different from those written before, causing corrupted metadata.

Steps to reproduce the behaviour

use a rowgroupsize = k and first k values has null value , the last k values does not.

aloneguid commented 6 years ago

@alexdeng it's not an issue anymore with latest parquet.net library, we need to update the reference.

aloneguid commented 6 years ago

all fixed

alexdeng commented 6 years ago

I don't think this fixed the issue. In the code dataset is manually flushed, and reset to null.

  public void Add(IRow row, Stream targetStream)
  {
     if (_writer == null) BuildWriter(row.Schema, targetStream);

     if(_ds == null) _ds = new DataSet(_schema);

     _ds.Add(ToRow(row));

     if (_ds.Count >= RowGroupSize)
     {
        FlushDataSet();
     }
  }

  private void FlushDataSet()
  {
     if (_ds == null) return;

     _writer.Write(_ds);

     _ds = null;
  }

But the metadata is written in the footer when the writer is disposed. so the meta data is solely determined by the last batch of _ds, right?

code from parquet.net public void Dispose() { if (!_dataWritten) return;

     //finalize file
     long size = ThriftStream.Write(_meta.ThriftMeta);

     //metadata size
     Writer.Write((int)size);  //4 bytes

     //end magic
     WriteMagic();              //4 bytes

     Writer.Flush();
     Stream.Flush();
  }
aloneguid commented 6 years ago

The issue is fixed in core library, not this repo. Please check you have re-registered assemblies and recompiled usql script)

alexdeng commented 6 years ago

Hi,

I might miss something. First I'm not using usql :(, I ported your code into a custom outputter I use in scope, which is a Microsoft internal predecessor of usql. I already update my parquet.net to the bleading edge alpha release. what I am reporting is special to this repo, not the core library. the way the current logic of this repo does is to write dataset piece by piece but the metadata written at the end of the output file is only from the last batch, potentially causing a mismatching metadata written in the end of the file and those rowgroups written before that.

When I changed the logic to only flushdataset once in dispose, this issue goes away. But there causes the outputter to store the whole dataset, potentially causing some performance issue.

alexdeng commented 6 years ago

Just to be sure my understanding is correct. The metadata is stored in the schemaelement, which is part of the DataSet object. The parquetwriter will get things like is a column has null from column statistics in the from the flattened schemaelements directly. So in this sense when we set _ds = null, we have to preserve those information somewhere?

aloneguid commented 6 years ago

hi @alexdeng, this was mostly happening because parquet.net is trying to guess schema nullability based on values passed when writing i.e. when you had no nulls native parquet schema is generated slightly different to save disk space. This leads to some misunderstandings from interface side and also problems like you've described. Therefore the latest alpha forces you to specify whether you needs nulls or not. This makes the library both simpler and for developers to easier understand it. See more explanation here: https://github.com/elastacloud/parquet-dotnet/blob/master/doc/schema.md#null-values

alexdeng commented 6 years ago

Great! This answered my concern. Thanks for the explanation!

Get Outlook for iOShttps://aka.ms/o0ukef


From: Ivan Gavryliuk notifications@github.com Sent: Monday, October 23, 2017 2:52:32 AM To: elastacloud/parquet-usql Cc: alexdeng; Mention Subject: Re: [elastacloud/parquet-usql] metadata corruption due to nullable/optional columns (#30)

hi @alexdenghttps://github.com/alexdeng, this was mostly happening because parquet.net is trying to guess schema nullability based on values passed when writing i.e. when you had no nulls native parquet schema is generated slightly different to save disk space. This leads to some misunderstandings from interface side and also problems like you've described. Therefore the latest alpha forces you to specify whether you needs nulls or not. This makes the library both simpler and for developers to easier understand it. See more explanation here: https://github.com/elastacloud/parquet-dotnet/blob/master/doc/schema.md#null-values

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/elastacloud/parquet-usql/issues/30#issuecomment-338607515, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABsKGfB1wpY64pOl7NcVuZXXauMeBhxsks5svGHegaJpZM4QAJLB.

alexdeng commented 6 years ago

@aloneguid, in the above doc I think it will be better to also mention string is always treated as nullable/optional.

aloneguid commented 6 years ago

Thanks @alexdeng, will add it to the docs)