domoritz / arrow-tools

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet
Apache License 2.0
160 stars 9 forks source link

Timestamp mapping #95

Open matasello opened 5 months ago

matasello commented 5 months ago

Not sure I am doing this right, but I am trying to convert a CSV containing some timestamp to a parquet file.

Sample CSV

072e4a64-2ffb-437c-9458-4953abaa7a20,1,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,2,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,4,2023-01-18 23:05:10,104,-1,0
  1. First, the schema is generated with the csv2parquet --max-read-records 5 -p option. It correctly infers the timestamp field
    {
      "name": "ts",
      "data_type": {
        "Timestamp": [
          "Second",
          null
        ]
      },
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
  1. Then I do the actual conversion

csv2parquet --header false --schema-file mt_status.json /dev/stdin mt_status.parquet

  1. Then I try to open the table using duckdb, and I can see all the records, but the timestamp field shows as Int64
┌──────────────────────────────────────┬───────┬────────────┬──────────┬────────┬───────────┐
│                 guid                 │  st   │     ts     │ tsmillis │ result │ synthetic │
│               varchar                │ int16 │   int64    │  int16   │ int16  │   int16   │
├──────────────────────────────────────┼───────┼────────────┼──────────┼────────┼───────────┤
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     1 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     2 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     4 │ 1674083110 │      104 │     -1 │         0 │
  1. And the parquet schema also shows the field as a Int64

│ mt_status.parquet │ ts │ INT64 │ │ REQUIRED │ │ │ │ │ │ │

Any hint ? Thanks

loicalleyne commented 1 month ago

@domoritz I've run into the json2parquet uses arrow::json::reader::infer_json_schema_from_seekable

It doesn't look like arrow-rs Arrow-jsoncollect_field_types_from_object` does any kind of timestamp inference at all. https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/schema.rs#L88

The arrow-rs arrow-json has a low-level decoder that seems has some kind of support for coercing types to timestamp however I'm not sure how that would work and whether enabling timestamp detection to schema inference would need to be done in json2parquet or in arrow-rs. https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/mod.rs

domoritz commented 1 month ago

Hmm, thanks for looking into this. I won't have time to look into this deeply anytime soon but I'd be more than happy to review a pull request.