CrunchyData / pg_parquet

Copy to/from Parquet in S3 from within PostgreSQL
Other
358 stars 11 forks source link

Importing Array(String) from parquet file does not work with text[] column #79

Open aseigo opened 19 hours ago

aseigo commented 19 hours ago

Using the file at s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet (part of https://opensource.foursquare.com/os-places/ .. ~434MB in size), pg_parquet shows the following schema:

fspoi=# select * from parquet.schema('places-00000.snappy.parquet');
                               uri                                |        name         | type_name  | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type
------------------------------------------------------------------+---------------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------
 places-00000.snappy.parquet | arrow_schema        |            |             |                 |           23 |                |       |           |          |
 places-00000.snappy.parquet | fsq_place_id        | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | name                | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | latitude            | DOUBLE     |             | OPTIONAL        |              |                |       |           |          |
 places-00000.snappy.parquet | longitude           | DOUBLE     |             | OPTIONAL        |              |                |       |           |          |
 places-00000.snappy.parquet | address             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | locality            | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | region              | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | postcode            | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | admin_region        | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | post_town           | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | po_box              | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | country             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | date_created        | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | date_refreshed      | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | date_closed         | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | tel                 | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | website             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | email               | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | facebook_id         | INT64      |             | OPTIONAL        |              |                |       |           |          |
 places-00000.snappy.parquet | instagram           | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | twitter             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | fsq_category_ids    |            |             | OPTIONAL        |            1 | LIST           |       |           |          | LIST
 places-00000.snappy.parquet | list                |            |             | REPEATED        |            1 |                |       |           |          |
 places-00000.snappy.parquet | element             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | fsq_category_labels |            |             | OPTIONAL        |            1 | LIST           |       |           |          | LIST
 places-00000.snappy.parquet | list                |            |             | REPEATED        |            1 |                |       |           |          |
 places-00000.snappy.parquet | element             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
(28 rows)

and creating a table in postgresql as:

create table places (
  fsq_place_id        text primary key,
  name                text,
  latitude            double precision,
  longitude           double precision, 
  address             text,
  locality            text,
  region              text,
  postcode            text,
  admin_region        text,
  post_town           text,
  po_box              text,
  date_created        text,
  country             text, 
  date_refreshed      text,
  date_closed         text, 
  tel                 text, 
  website             text,
  email               text,
  facebook_id         bigint,
  instagram           text,
  twitter             text,
  fsq_category_ids     text[],
  fsq_category_labels text[]
);

then attempting to copy the file into the table with:

COPY places FROM 'places-00000.snappy.parquet'  WITH (format 'parquet');

results in the following error:

ERROR:  type mismatch for column "fsq_category_ids" between table and parquet file. table expected "List(Field { name: "fsq_category_ids", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "22"} })" but file had "List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })"

I also tried with a custom type such as CREATE TYPE element AS (elment text[]); but that ends up creating a list of records, which also does not match the parquet file.

I'm not sure if I'm doing something wrong here (the documentation is a bit light on how arrays are intended to work?), or if this is a bug in pg_parquet.

aseigo commented 19 hours ago

Idle wondering: is this because the names do not match? Is this another symptom of #39 ?

aykut-bozkurt commented 19 hours ago

Field names seem correct but COPY FROM is currently not interop well (requires exact match) with parquet files written by other tools. Hopefully, this PR should resolve it by allowing more relaxed schema match.

I also plan to introduce match_by_position option for COPY FROM to resolve #39.

EDIT: looks like field name also mismatch for the list element. So yes, another symptom of #39.

aseigo commented 19 hours ago

looks like field name also mismatch for the list element. So yes, another symptom of #39.

To make it even more tricky, I'm not even sure how to make the field name for the list element match. If the field in the table is changed from fsq_category_ids to element, then the error becomes that there is no element field in the parquet file (which is correct, of course!).

It is nice to match on name where possible ... perhaps a smaller useful change here would be to just ignore the name of the field on list elements, as they aren't nameable (afaik, anyways) in postgresql.

p.s. thank-you for this tool. Being able to drag data into pgsql from parquet files is very snazzy.

aykut-bozkurt commented 18 hours ago

thanks for the feedback, hopefully we will improve COPY FROM experience in a few weeks. There will be a few more PRs coming after the #39. The issue will also be fixed.

aykut-bozkurt commented 18 hours ago

you can try checkout to #39 btw. It has good chance to resolve the element name mismatch (cast will allow it), which will merged the next week.