astrolabsoftware / fink-science-portal

Fink Science Portal
https://fink-portal.org
Apache License 2.0
7 stars 4 forks source link

Inconsistency in index tables #580

Closed JulienPeloton closed 7 months ago

JulienPeloton commented 7 months ago

I recently (https://github.com/astrolabsoftware/fink-broker/pull/717) factorized the code that defines the columns to be used for index table. Before, it was defined manually (with the risk of forgetting columns), while now the code calls the function fink_broker.hbaseUtils.load_ztf_index_cols. Nothing wrong with this, and rather a good practice.

But, the new set of columns contains columns previously not taken with a type integer. For the sake of this discussion, let's assume there is a new column called col with data type integer. HBase being schemaless, when we issue a query to get the data across multiple dates, some of the entries will contain information on col, and some entries will not. HBase being non-relational, it does not care. But when we then format the data into a Pandas DataFrame, which implicitly assumes the same data structure for all rows, the magic operates -- entries without information on col gets suddenly filled with NaN:

d:nalerthist
<class 'int'>
Blazar_2460341.6173264_ZTF18adnclbq      2
Blazar_2460341.633206_ZTF18abwqdtl       6
Blazar_2460341.6564815_ZTF21acqryod      5
Blazar_2460341.6785185_ZTF18adaiwre     10
Blazar_2460341.6196875_ZTF18acbvgss     10
Blazar_2460341.715_ZTF18acewtco        829
Blazar_2460341.6712037_ZTF21acqryod      6
Blazar_2460341.7433218_ZTF18adaiwre     11
Blazar_2460340.882963_ZTF18aawwazx     NaN
Blazar_2460340.8867824_ZTF18aahnkft    NaN
Name: d:nalerthist, dtype: object

It is actually easy to reproduce:

dict_ = {'a': {'one': 1, 'two': 2}, 'b': {'two': 2}}
pd.DataFrame.from_dict(dict_)
     a    b
one  1  NaN
two  2  2.0

Not only it fills the missing entry with NaN, but it casts the entire column to float due to the presence of the NaN...

Action item:

karpov-sv commented 7 months ago

Well, if we have some sensible default value for these missing fields (do we?..) then the following approach works:

dict_ = {'a': {'one': 1, 'two': 2}, 'b': {'two': 2}}
pd.DataFrame.from_dict(dict_).fillna({'b':0}).astype({'b':int})

        a   b
one 1   0
two 2   2

Good side is what it may be specified just for the columns you need, and in a way that may be generalized and stored somewhere. Actually - we do have a concept of the schema when reading from the database, yes? Probably it may be extended to also include default values?.. Then it may be done in a centralized and generic way, like we already do for type conversion in format_hbase_output

JulienPeloton commented 7 months ago

Yes this is a good approach. Actually the schema is defined in the fink-broker repository

https://github.com/astrolabsoftware/fink-broker/blob/fe1aaca1c814a97b8c93d00bda9232ea17fec850/fink_broker/hbaseUtils.py#L99-L238

but to be truly useful (one does not want to install fink-broker!), this should be moved to fink-utils.

karpov-sv commented 7 months ago

But we also have access to some schema directly from the database? Probably it is the place where it should reside

JulienPeloton commented 7 months ago

Both should be used eventually (manually defined schema & schema from the database). The schema from the database is always inferred from the pushed data, but it does not prevent wrong types to be pushed. While the manually defined schema can be used to detect inconsistencies.

For the record, Pandas proposes a built-in type for missing values in integer columns (https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/missing_data.html#integer-dtypes-and-missing-data):

dict_ = {'a': {'one': 1, 'two': 2}, 'b': {'two': 2}}

pdf = pd.DataFrame.from_dict(dict_)
     a    b
one  1  NaN
two  2  2.0

pdf['b'] = pdf['b'].astype(pd.Int64Dtype())
     a     b
one  1  <NA>
two  2     2