Add value-based tests for table ingestion

diskontinuum commented 4 years ago

Current tests only check the size of (but not the content of) the concatenated tables (for both --parquet and --sqlite). However, the tables have been modified:

Added TableNumber column and added prefixed in get_and_modify_df89
Alignment with reference dataframe in write_to_disk().

To compare the written and chopped files with the original ones, value-based tests would have to modify the original tables accordingly.

diskontinuum commented 4 years ago

Attention: When reading the files from Parquet back into a Pandas dataframe, nullable values are implicitly type-converted from int to float (read here), which will throw an error when asserting equivalence.

To make sure the types are identical even for the new NaN columns, cast them explicitly (with df[x] = df[x].astype(ref_df[x].dtypes) ).

diskontinuum commented 4 years ago

Attention: Do not use the index_col=0 parameter in pandas_df = pd.read_csv(tmp_source, index_col=0) when importing from csvt to Pandas dataframes, because the first column will be used as a row label and cease to exist as a column.

gwaybio commented 4 years ago

totally agree with adding value-based tests... our current system of only checking shape is very dangerous!

gwaybio commented 4 years ago

I am adding a documentation snippet previously in write.py (removed in #130)

""" --------- code snippets for testing code ---------
# -------- pandas dataframe alignment ----------------
# (note: missing columns are added with same name and type
#  as in ref_dataframe, but containing NaN values.)
dataframe, ref_dataframe_new = dataframe.align(ref_dataframe, join="right", axis=1)
# assert that the reference table has not been modified by the alignment.
assert ref_dataframe_new.equals(ref_dataframe)
# --------- identical pandas schemata-----------------
assert dataframe.dtypes.equals(ref_dataframe.dtypes)
# --------- identical pyarrow schemata----------------
# (note: use "==" for pyarrrow schema comparisons, not "is")
table = pyarrow.Table.from_pandas(dataframe)
assert (table.schema.types == writers_dict[name]["schema"].types)
assert (table.schema.names == writers_dict[name]["schema"].names)
"""

cytomining / cytominer-database

Add value-based tests for table ingestion #129