CityofSantaMonica / mds-provider

Python tools for working with MDS Provider data
https://github.com/openmobilityfoundation/mobility-data-specification
MIT License
18 stars 20 forks source link

column of type enum[] is not casted correctly in db/load.py #66

Closed yoimnik closed 5 years ago

yoimnik commented 5 years ago

i get the error copy/pasted below when i try to run an ingest on mds-provider-services

i see that in mds-provider-services, the status_changes table is created and there's a column created like this: propulsion_type propulsion_types[] NOT NULL,

propulsion_types is an enum that's also created, that looks like this:

CREATE TYPE public.propulsion_types AS ENUM (
    'human',
    'electric_assist',
    'electric',
    'combustion'
);

now it seems like there's no handling for this on the sql alchemy side. it looks like it cannot automatically cast a text[] type into propulsion_types[]

i don't know how to fix this, i've tried pandas categorical types but that didn't work either. does pandas just not handle enum arrays in sql?

Loading status_changes from Bird
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1170, in _execute_context
    context)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 683, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.ProgrammingError: column "propulsion_type" is of type propulsion_types[] but expression is of type text[]
LINE 1: ...es": [-118.49533833333335, 34.01016666666667]}}', ARRAY['ele... ARRAY['electric']
                                                             ^
HINT:  You will need to rewrite or cast the expression.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "main.py", line 362, in <module>
    ingest(mds.STATUS_CHANGES, client=client, **kwargs)
  File "main.py", line 285, in ingest
    load_data(datasource, record_type, **kwargs)
  File "/home/jovyan/work/mds/load.py", line 76, in load_data
    db.load_status_changes(src)
  File "/home/jovyan/work/mds/src/mds-provider/mds/db/load.py", line 322, in load_status_changes
    self.load_from_source(source, mds.STATUS_CHANGES, table, before_load=__before_load, **kwargs)
  File "/home/jovyan/work/mds/src/mds-provider/mds/db/load.py", line 253, in load_from_source
    self.load_from_source(page, record_type, table, **kwargs)
  File "/home/jovyan/work/mds/src/mds-provider/mds/db/load.py", line 248, in load_from_source
    self.load_from_records(records, record_type, table, **kwargs)
  File "/home/jovyan/work/mds/src/mds-provider/mds/db/load.py", line 200, in load_from_records
    self.load_from_df(df, record_type, table, **kwargs)
  File "/home/jovyan/work/mds/src/mds-provider/mds/db/load.py", line 145, in load_from_df
    df.to_sql(table, self.engine, if_exists="append", index=False)
  File "/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py", line 2130, in to_sql
    dtype=dtype)
  File "/opt/conda/lib/python3.6/site-packages/pandas/io/sql.py", line 450, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/opt/conda/lib/python3.6/site-packages/pandas/io/sql.py", line 1127, in to_sql
    table.insert(chunksize)
  File "/opt/conda/lib/python3.6/site-packages/pandas/io/sql.py", line 641, in insert
    self._execute_insert(conn, keys, chunk_iter)
  File "/opt/conda/lib/python3.6/site-packages/pandas/io/sql.py", line 616, in _execute_insert
    conn.execute(self.insert_statement(), data)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 948, in execute
    return meth(self, multiparams, params)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
    context)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 248, in reraise
    raise value.with_traceback(tb)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1170, in _execute_context
    context)
  File "/opt/conda/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 683, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "propulsion_type" is of type propulsion_types[] but expression is of type text[]
LINE 1: ...es": [-118.49533833333335, 34.01016666666667]}}', ARRAY['ele...
                                                             ^
HINT:  You will need to rewrite or cast the expression.
yoimnik commented 5 years ago

i changed the type of that column from propulsion_types[] to text[] , moved past that error, but now i ran into another casting issue:

psycopg2.ProgrammingError: column "event_time" is of type timestamp with time zone but expression is of type numeric
LINE 1: ....015159746663656]}}', 'user_pick_up', 'reserved', 1547268299...

did we change something with how we interface with the mds-provider-services db via sqlalchemy or was there a version change on sqlalchemy's end?

yoimnik commented 5 years ago

nvm i think i found the issue --

we were running this with --stage_first 0... which causes the code block https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/load.py#L143

if we don't specify --stage_first then the code doesn't work and gives us this error:

Traceback (most recent call last):
  File "main.py", line 362, in <module>
    ingest(mds.STATUS_CHANGES, client=client, **kwargs)
  File "main.py", line 285, in ingest
    load_data(datasource, record_type, **kwargs)
  File "/home/jovyan/work/mds/load.py", line 60, in load_data
    stage_first = int(kwargs.get("stage_first", False))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
yoimnik commented 5 years ago

we need to correctly handle default stage_first

thekaveman commented 5 years ago

The issue here was with a bad stage_first default as noted in CityofSantaMonica/mds-provider-services#37.

The code in mds-provider is behaving as expected (without staging first, data is dumped directly to the target table and assumes the target table can handle it).

I'm closing this and we'll get the fix in the linked PR merged ASAP.