Open totalgit74 opened 4 months ago
Parquet format is not streamable by design. You need to have entire file available for random access.
I think you’ve skimmed this and missed the point. As I wait for the entire stream to be complete from the API as far as the code is concerned it should be no different from a file stream as, to all intents and purposes, it is a file download.
It’s as if the stream.ReadParquetAsDataFrameAsync method needs to be able to accept ParquetOptions whereby the TreatLargeIntegersAsDates can be set to true which may well solve the problem. I believe ParquetReader can accept such a parameter in its read to its own table type but this extension method does not.
As some extra information I've debugged through into the Parquet.Net code and found the following in the Parquet.Meta.FileMetaData
located here
https://github.com/aloneguid/parquet-dotnet/blob/d8febde144ae4f2424947caa90a625c4c9d7ab49/src/Parquet/ParquetReader.cs#L39-L47
Under the Schema
field at index [1] the type is "INT64" and the LogicalType
has a value against the TIMESTAMP
entry of Parquet.Meta.TimestampType
with unit NANOS
.
Let me know if there is any other such data that may be of use to you.
Parquet.Meta.FileMetaData
KeyValueMetaData
field with key "pandas"
{
"index_columns": [
{
"kind": "range",
"name": null,
"start": 0,
"stop": 175296,
"step": 1
}
],
"column_indexes": [
{
"name": null,
"field_name": null,
"pandas_type": "unicode",
"numpy_type": "object",
"metadata": {
"encoding": "UTF-8"
}
}
],
"columns": [
{
"name": "ts",
"field_name": "ts",
"pandas_type": "datetime",
"numpy_type": "datetime64[ns]",
"metadata": null
},
...
],
"creator": {
"library": "pyarrow",
"version": "15.0.0"
},
"pandas_version": "2.2.0"
}
Issue description
Issue
When reading parquet from a web API stream where the data is served from python I am getting numeric values for dates. I have outlined two scenarios below, one where the data is returned as the incorrect type and the other where it is the correct type. I'm wondering whether this is uniquely a web-stream vs file stream issue or whether there is something else at play.
The end-point is an internal corporate one so there's no chance of replication outside the network.
The following packages are being used: Flurl (4.0.0) Flurl.Http (4.0.2) Parquet.Net (4.23.4)
The project is C# on framework 4.8.1 OS Windows 10 Enterprise 10.0.19044
Scenarios
Reading a parquet stream from the web -> incorrect types
If I perform the following in Python
I get a dataframe as expected with the first column as dates of type
datetime64[ns]
If I get a stream in C# using flurl (fluent wrapper around System.Net.Http) with
I end up with the first column as dates of type
System.Int64
.Reading a file stream - correct types
However, if I store the streamed parquet as a parquet file using python
then read in using .Net
I get the first column as type
System.DateTime
as expected.