Open diskontinuum opened 4 years ago
Attention: When reading the files from Parquet back into a Pandas dataframe, nullable values are implicitly type-converted from int to float (read here), which will throw an error when asserting equivalence.
To make sure the types are identical even for the new NaN columns, cast them explicitly (with df[x] = df[x].astype(ref_df[x].dtypes)
).
Attention:
Do not use the index_col=0
parameter in pandas_df = pd.read_csv(tmp_source, index_col=0)
when importing from csvt to Pandas dataframes, because the first column will be used as a row label and cease to exist as a column.
totally agree with adding value-based tests... our current system of only checking shape is very dangerous!
I am adding a documentation snippet previously in write.py
(removed in #130)
""" --------- code snippets for testing code ---------
# -------- pandas dataframe alignment ----------------
# (note: missing columns are added with same name and type
# as in ref_dataframe, but containing NaN values.)
dataframe, ref_dataframe_new = dataframe.align(ref_dataframe, join="right", axis=1)
# assert that the reference table has not been modified by the alignment.
assert ref_dataframe_new.equals(ref_dataframe)
# --------- identical pandas schemata-----------------
assert dataframe.dtypes.equals(ref_dataframe.dtypes)
# --------- identical pyarrow schemata----------------
# (note: use "==" for pyarrrow schema comparisons, not "is")
table = pyarrow.Table.from_pandas(dataframe)
assert (table.schema.types == writers_dict[name]["schema"].types)
assert (table.schema.names == writers_dict[name]["schema"].names)
"""
Current tests only check the size of (but not the content of) the concatenated tables (for both
--parquet
and--sqlite
). However, the tables have been modified:TableNumber
column and added prefixed inget_and_modify_df89
write_to_disk()
.To compare the written and chopped files with the original ones, value-based tests would have to modify the original tables accordingly.