Datatamer / tamr-client

Programmatically interact with Tamr
https://tamr-client.readthedocs.io
Apache License 2.0
11 stars 25 forks source link

Doc suggestion of using `astype(str)` on DataFrames before upload to Tamr is bad for `NaN`s #426

Closed skalish closed 4 years ago

skalish commented 4 years ago

The pandas method astype(str) casts all values in a DataFrame to strings, which allows them to be successfully uploaded to Tamr. However, special values like NaN and the Python None will be converted into strings (e.g. "NaN"), introducing non-standard nulls to your Tamr dataset.

An alternative is casting with astype(object), which will preserve these special values. I'm unsure if there is a downside to this, but I think it is probably a better practice overall.

Related to #323, maybe #373

olivito commented 4 years ago

i tested casting to object after creation, and it failed on the upsert to Tamr

This works for me:

df = pd.read_csv("my_file.csv", dtype=object)
dataset.upsert_from_dataframe(df, "my_pk")

This doesn't work:

df = pd.read_csv("my_file.csv")
df = df.astype(object)
dataset.upsert_from_dataframe(df, "my_pk")

I'm not sure what the best solution is though for an existing dataframe.