aeturrell / skimpy

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
https://aeturrell.github.io/skimpy/
Other
396 stars 20 forks source link

`skim` throws exception if the first column contains purely None's #885

Open vojtech-filipec opened 4 days ago

vojtech-filipec commented 4 days ago

When your dataset contains purely nulls in the first column the skim() method throws the Unbound LocalError.

I think this is very relevant if you download a sample from a database, and by chance the first column contains only None.

Example code:

from skimpy import skim, generate_test_data

df = generate_test_data()

df_with_nulls = df.copy()
df_with_nulls['rnd'] = None
print("\n skim(df_with_nulls[['width','rnd']])` after replacing `rnd` (the second column) with Nones: \n")
skim(df_with_nulls[['width', 'rnd']])

print("\n `skim(df_with_nulls[['rnd','width']])` after replacing `rnd` (now the first column) with solely Nones: \n")
skim(df_with_nulls[['rnd', 'width',]])

Output:


 skim(df_with_nulls[['width','rnd']])` after replacing `rnd` (the second column) with Nones: 

/Users/vojtechfilipec/Documents/repos/streamlit-dashboards/.venv/lib/python3.12/site-packages/numpy/lib/histograms.py:885: RuntimeWarning: invalid value encountered in divide
  return n/db/n.sum(), bin_edges
╭────────────────────────────────────────────── skimpy summary ──────────────────────────────────────────────╮
│          Data Summary                Data Types                                                            │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                     │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                     │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                     │
│ │ Number of rows    │ 1000   │ │ float64     │ 2     │                                                     │
│ │ Number of columns │ 2      │ └─────────────┴───────┘                                                     │
│ └───────────────────┴────────┘                                                                             │
│                                                  number                                                    │
│ ┏━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name   ┃ NA    ┃ NA %  ┃ mean   ┃ sd     ┃ p0        ┃ p25    ┃ p50   ┃ p75   ┃ p100  ┃ hist   ┃  │
│ ┡━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━┩  │
│ │ width         │     0 │     0 │  2.037 │  1.929 │  0.002057 │  0.603 │ 1.468 │ 2.953 │ 13.91 │  ▇▃▁   │  │
│ │ rnd           │  1000 │   100 │    nan │    nan │       nan │    nan │   nan │   nan │   nan │        │  │
│ └───────────────┴───────┴───────┴────────┴────────┴───────────┴────────┴───────┴───────┴───────┴────────┘  │
╰─────────────────────────────────────────────────── End ────────────────────────────────────────────────────╯

 `skim(df_with_nulls[['rnd','width']])` after replacing `rnd` (now the first column) with solely Nones: 

Traceback (most recent call last):
  File "/Users/vojtechfilipec/Documents/repos/streamlit-dashboards/test_skimmer.py", line 11, in <module>
    skim(df_with_nulls[['rnd', 'width',]])
  File "/Users/vojtechfilipec/Documents/repos/streamlit-dashboards/.venv/lib/python3.12/site-packages/skimpy/__init__.py", line 698, in skim
    grid, json_data = _skim_computation(df_out)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vojtechfilipec/Documents/repos/streamlit-dashboards/.venv/lib/python3.12/site-packages/skimpy/__init__.py", line 579, in _skim_computation
    df = _infer_datatypes(df)
         ^^^^^^^^^^^^^^^^^^^^
  File "/Users/vojtechfilipec/Documents/repos/streamlit-dashboards/.venv/lib/python3.12/site-packages/skimpy/__init__.py", line 130, in _infer_datatypes
    df[col[0]] = df[col[0]].astype(data_type)
                                   ^^^^^^^^^
UnboundLocalError: cannot access local variable 'data_type' where it is not associated with a value

skimpy==0.0.15 Python 3.9.6

aeturrell commented 2 days ago

Thank you for raising this!

I think there are two potential approaches to take here. One would be to ignore columns that only consist of "None". Another would be to give special treatment to columns of type Object, much as every other data type gets (this is what columns with only None have as their type). Interested in your thoughts on whether it's useful to have a section on any None-only columns.

Note to self: this error happens because the if/else part of _infer_datatypes() has no clause to catch a column of None, because that has type object, which is not currently featured.