aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

Issue with certain stream sources #477

Open totalgit74 opened 4 months ago

totalgit74 commented 4 months ago

I think you may have missed the point on issue #476 and have closed the issue in haste.

As I wait for the entire stream to be complete from the API as far as the code is concerned it should be no different from a file stream as, to all intents and purposes, it is a file download and has completed at that point.

  1. Pandas can latch onto the feed from the web API and decodes all columns correctly.
  2. Parquet.net can also do so (as I wait for the entire stream to be present which means it is not different from a file stream at that point), correctly handling the stream but has an issue with the datetime64 column, all other data is decoded correctly.
  3. Parquet.net does not have an issue if pandas persists the file for it.

It’s as if the stream.ReadParquetAsDataFrameAsync method needs to be able to accept ParquetOptions whereby the TreatLargeIntegersAsDates can be set to true which may well solve the problem. I believe ParquetReader can accept such a parameter in its read to its own table type but this extension method providing a Microsoft.Data.Analysis.DataFrame does not.

Originally posted by @totalgit74 in https://github.com/aloneguid/parquet-dotnet/issues/476#issuecomment-1956350077