aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

Issue reading parquet from a stream #476

Open totalgit74 opened 4 months ago

totalgit74 commented 4 months ago

Issue description

Issue

When reading parquet from a web API stream where the data is served from python I am getting numeric values for dates. I have outlined two scenarios below, one where the data is returned as the incorrect type and the other where it is the correct type. I'm wondering whether this is uniquely a web-stream vs file stream issue or whether there is something else at play.

The end-point is an internal corporate one so there's no chance of replication outside the network.

The following packages are being used: Flurl (4.0.0) Flurl.Http (4.0.2) Parquet.Net (4.23.4)

The project is C# on framework 4.8.1 OS Windows 10 Enterprise 10.0.19044

Scenarios

Reading a parquet stream from the web -> incorrect types

If I perform the following in Python

pd.read_parquet(url, storage_options={"Accept": "stream/parquet"})

I get a dataframe as expected with the first column as dates of type datetime64[ns]

If I get a stream in C# using flurl (fluent wrapper around System.Net.Http) with

url
.WithHeader("accept", "stream/parquet")
.GetStreamAsync(HttpCompletionOption.ResponseContentRead)
.Result
.ReadParquetAsDataFrameAsync()
.Result

I end up with the first column as dates of type System.Int64.

Reading a file stream - correct types

However, if I store the streamed parquet as a parquet file using python

result = pd.read_parquet(url, storage_options={"Accept": "stream/parquet"})
result.to_parquet('./temp.parquet')

then read in using .Net

using (var stream = File.OpenRead("<< path to temp.parquet >>"))
{
    var table = stream.ReadParquetAsDataFrameAsync().Result;
}

I get the first column as type System.DateTime as expected.

aloneguid commented 4 months ago

Parquet format is not streamable by design. You need to have entire file available for random access.

totalgit74 commented 4 months ago

I think you’ve skimmed this and missed the point. As I wait for the entire stream to be complete from the API as far as the code is concerned it should be no different from a file stream as, to all intents and purposes, it is a file download.

  1. Pandas can latch onto the feed from the web API correctly
  2. Parquet.net can also do so as I wait for the entire stream to be present which means it is not different from a file stream at that point, correctly handling the stream but has an issue with the datetime64 column, all other data is decoded correctly
  3. Parquet.net does not have an issue if pandas persists the file for it.

It’s as if the stream.ReadParquetAsDataFrameAsync method needs to be able to accept ParquetOptions whereby the TreatLargeIntegersAsDates can be set to true which may well solve the problem. I believe ParquetReader can accept such a parameter in its read to its own table type but this extension method does not.

totalgit74 commented 4 months ago

As some extra information I've debugged through into the Parquet.Net code and found the following in the Parquet.Meta.FileMetaData located here https://github.com/aloneguid/parquet-dotnet/blob/d8febde144ae4f2424947caa90a625c4c9d7ab49/src/Parquet/ParquetReader.cs#L39-L47 Under the Schema field at index [1] the type is "INT64" and the LogicalType has a value against the TIMESTAMP entry of Parquet.Meta.TimestampType with unit NANOS.

Let me know if there is any other such data that may be of use to you.

Parquet.Meta.FileMetaData KeyValueMetaData field with key "pandas"

{
    "index_columns": [
        {
            "kind": "range",
            "name": null,
            "start": 0,
            "stop": 175296,
            "step": 1
        }
    ],
    "column_indexes": [
        {
            "name": null,
            "field_name": null,
            "pandas_type": "unicode",
            "numpy_type": "object",
            "metadata": {
                "encoding": "UTF-8"
            }
        }
    ],
    "columns": [
        {
            "name": "ts",
            "field_name": "ts",
            "pandas_type": "datetime",
            "numpy_type": "datetime64[ns]",
            "metadata": null
        },
        ...
    ],
    "creator": {
        "library": "pyarrow",
        "version": "15.0.0"
    },
    "pandas_version": "2.2.0"
}