Open hunterowens opened 4 years ago
It's funny because the initial reason for staging was precisely because incoming data didn't match the final destination table schema. The tmp tables were "dump the raw incoming data here" targets, with transformations applied at the database level to convert to expected types.
Consider all the timestamp fields for example: the incoming type int
epoch milliseconds doesn't match the final destination type timestamptz
(assuming you're using Postgres and making that conversion vs. storing the raw value). It doesn't matter if pandas decides to infer a tmp timestamp column as int
or float
since we perform the cast/conversion explicitly when copying from tmp to final.
Maybe another option here is to ditch staging entirely, and enforce the schema via pandas/Dataframe semantics vs. the INSERT INTO... SELECT...FROM
dance. My pandas-fu is lacking in how to deal with e.g. our enumerated types.
The specific problem we were seeing is with the associated_trips
field. Since it is null
much of the time, Pandas seems to be defaulting to nullable-double
, which is not correct, and not castable to UUID.
Perhaps a way to satisfy both constraints here is to force all the columns of the temp table to be a TEXT
type (since they are coming from JSON text, typically). I think that should be castable to everything we care about, right?
With regards to Pandas -- it's type system is somewhat less expressive than PostgreSQL, though it has recently gained the ability to extend the type system via extension arrays. I don't think that is likely worth our time, though.
An aside, extension arrays have just landed in GeoPandas, which is very exciting!
@ian-r-rose null associated_trips
is why we originally introduced some of that preprocessing around adding missing nullable columns to the temp dataframe, and dtype-casting the associated_trips
column to object
. See #28 and the fix in #29. You're still getting errors related to this though?
Yes, the error looks very similar to #28 (though with uuid
instead of uuid[]
, natch). The fix from #29 should no longer be relevant, right?
The fix from #29 evolved but is still relevant, see https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/db.py#L234 and specifically for this discussion https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/db.py#L255
Is there a reason that you have cast it to 'object'
and not str
(at least in the case of MDS > 0.3.0)?
@ian-r-rose I think that is probably a left-over from MDS 0.2.x days when this was an array field. Would a cast to str
here fix your issue you think?
if using
stage_first
, tmp table schema is inferred from the dataframe.Rather, the tmp table schema should be copied from the table in which you are going to upsert from.