Closed JulienPeloton closed 7 months ago
Well, if we have some sensible default value for these missing fields (do we?..) then the following approach works:
dict_ = {'a': {'one': 1, 'two': 2}, 'b': {'two': 2}}
pd.DataFrame.from_dict(dict_).fillna({'b':0}).astype({'b':int})
a b
one 1 0
two 2 2
Good side is what it may be specified just for the columns you need, and in a way that may be generalized and stored somewhere. Actually - we do have a concept of the schema when reading from the database, yes? Probably it may be extended to also include default values?.. Then it may be done in a centralized and generic way, like we already do for type conversion in format_hbase_output
Yes this is a good approach. Actually the schema is defined in the fink-broker
repository
but to be truly useful (one does not want to install fink-broker!), this should be moved to fink-utils
.
But we also have access to some schema directly from the database? Probably it is the place where it should reside
Both should be used eventually (manually defined schema & schema from the database). The schema from the database is always inferred from the pushed data, but it does not prevent wrong types to be pushed. While the manually defined schema can be used to detect inconsistencies.
For the record, Pandas proposes a built-in type for missing values in integer columns (https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/missing_data.html#integer-dtypes-and-missing-data):
dict_ = {'a': {'one': 1, 'two': 2}, 'b': {'two': 2}}
pdf = pd.DataFrame.from_dict(dict_)
a b
one 1 NaN
two 2 2.0
pdf['b'] = pdf['b'].astype(pd.Int64Dtype())
a b
one 1 <NA>
two 2 2
I recently (https://github.com/astrolabsoftware/fink-broker/pull/717) factorized the code that defines the columns to be used for index table. Before, it was defined manually (with the risk of forgetting columns), while now the code calls the function
fink_broker.hbaseUtils.load_ztf_index_cols
. Nothing wrong with this, and rather a good practice.But, the new set of columns contains columns previously not taken with a type
integer
. For the sake of this discussion, let's assume there is a new column calledcol
with data typeinteger
. HBase being schemaless, when we issue a query to get the data across multiple dates, some of the entries will contain information oncol
, and some entries will not. HBase being non-relational, it does not care. But when we then format the data into a Pandas DataFrame, which implicitly assumes the same data structure for all rows, the magic operates -- entries without information oncol
gets suddenly filled with NaN:It is actually easy to reproduce:
Not only it fills the missing entry with NaN, but it casts the entire column to
float
due to the presence of the NaN...Action item: