CityofSantaMonica / mds-provider

Python tools for working with MDS Provider data
https://github.com/openmobilityfoundation/mobility-data-specification
MIT License
18 stars 20 forks source link

don't infer schema on tmp tables #83

Open hunterowens opened 4 years ago

hunterowens commented 4 years ago

if using stage_first, tmp table schema is inferred from the dataframe.

Rather, the tmp table schema should be copied from the table in which you are going to upsert from.

thekaveman commented 4 years ago

It's funny because the initial reason for staging was precisely because incoming data didn't match the final destination table schema. The tmp tables were "dump the raw incoming data here" targets, with transformations applied at the database level to convert to expected types.

Consider all the timestamp fields for example: the incoming type int epoch milliseconds doesn't match the final destination type timestamptz (assuming you're using Postgres and making that conversion vs. storing the raw value). It doesn't matter if pandas decides to infer a tmp timestamp column as int or float since we perform the cast/conversion explicitly when copying from tmp to final.

Maybe another option here is to ditch staging entirely, and enforce the schema via pandas/Dataframe semantics vs. the INSERT INTO... SELECT...FROM dance. My pandas-fu is lacking in how to deal with e.g. our enumerated types.

ian-r-rose commented 4 years ago

The specific problem we were seeing is with the associated_trips field. Since it is null much of the time, Pandas seems to be defaulting to nullable-double, which is not correct, and not castable to UUID.

Perhaps a way to satisfy both constraints here is to force all the columns of the temp table to be a TEXT type (since they are coming from JSON text, typically). I think that should be castable to everything we care about, right?

ian-r-rose commented 4 years ago

With regards to Pandas -- it's type system is somewhat less expressive than PostgreSQL, though it has recently gained the ability to extend the type system via extension arrays. I don't think that is likely worth our time, though.

An aside, extension arrays have just landed in GeoPandas, which is very exciting!

thekaveman commented 4 years ago

@ian-r-rose null associated_trips is why we originally introduced some of that preprocessing around adding missing nullable columns to the temp dataframe, and dtype-casting the associated_trips column to object. See #28 and the fix in #29. You're still getting errors related to this though?

ian-r-rose commented 4 years ago

Yes, the error looks very similar to #28 (though with uuid instead of uuid[], natch). The fix from #29 should no longer be relevant, right?

thekaveman commented 4 years ago

The fix from #29 evolved but is still relevant, see https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/db.py#L234 and specifically for this discussion https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/db.py#L255

ian-r-rose commented 4 years ago

Is there a reason that you have cast it to 'object' and not str (at least in the case of MDS > 0.3.0)?

thekaveman commented 4 years ago

@ian-r-rose I think that is probably a left-over from MDS 0.2.x days when this was an array field. Would a cast to str here fix your issue you think?