adamerose / PandasGUI

A GUI for Pandas DataFrames
MIT No Attribution
3.17k stars 229 forks source link

non-dates are often converted to datetime #146

Closed rjsdotorg closed 3 years ago

rjsdotorg commented 3 years ago

Described
issue where non-dates are often converted to datetime.

Environment OS: Windows 10 Python: 3.8.2 IDE: Scite

Package versions
pandasgui==0.2.11 pandas==1.1.4


I changed utility.py line  to

`            return pd.to_datetime(s, format="%b %d %Y")`
and now I only see
_PandasGUI WARNING — pandasgui.utility — In metadata_df, converted columns to datetime: RecordingDate_
which is desired.

It does convert everything else to a string (like float columns), but that is not a big issue for me, visually.
Here are the first two lines and header which are a result of `df.to_csv('metadata_df.csv')` which I do just before `show(metadata_df)`
[metadata_df.txt](https://github.com/adamerose/PandasGUI/files/6625776/metadata_df.txt)

Firstly, when I upgraded to 0.2.11 it seems to need a new module
`>pip install pandasgui==0.2.11`
I got `ModuleNotFoundError: No module named 'qtstylish'`

python Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32

import pandas pandas.version '1.1.4' import pandasgui Traceback (most recent call last): File "", line 1, in File "C:\Python38\lib\site-packages\pandasgui__init.py", line 15, in from pandasgui.gui import show File "C:\Python38\lib\site-packages\pandasgui\gui.py", line 20, in import qtstylish ModuleNotFoundError: No module named 'qtstylish' import pandasgui pandasgui.version__ '0.2.11'

but it did install fine: `pip install qtstylish`, so now I have

python Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32 import pandas pandas.version '1.1.4' import pandasgui pandasgui.version '0.2.11'

I did see the simple call in utility.py parse_dates_series(), so it appears to me now it is a pandas "thing", but maybe the question was why is pandasgui trying to recast columns at all?

So, it is likely that reading from CSV is not a proper test here. The data was strings read from report forms, and the df types as created are

...
Side                    object
Location                object
NIHSScore              float64
TICIScore               object
IV_tPA_timing           object
EVT_timing              object
RecordingDate           object
RecordingStartTime      object
RecordingStopTime       object
Gender                  object
Age                     object
SystolicBP              object
DiastolicBP             object

my calling code saves that df to csv, then opens pandasgui on the dataframe for a look

        metadata_df.to_csv('metadata_df.csv')
        try:
            show(metadata_df)
        except ValueError:
            pass

When I call show() I get: a

I might need to generate a pickle of metadata_df for you to test, if you'd like; if I "filter" the df through CSV, I get a result very similar to your screen cap, but with floats instead of ints b

Small data pkl in a zip: head.zip

rjsdotorg commented 3 years ago

When I run the cleaner:

>>> from pandasgui.utility import clean_dataframe
>>> df = clean_dataframe(df)
PandasGUI WARNING — pandasgui.utility — In DataFrame, converted column names to string: SubjectID, Recording, Sequenze,
Status, StrokeType, Side, Location, NIHSScore, TICIScore, IV_tPA_timing, EVT_timing, RecordingDate, RecordingStartTime,
RecordingStopTime, Gender, Age, SystolicBP, DiastolicBP, HeartRate, Height, Weight, Consciousness, NIHSS_Cons, NIHSS_Con
sQuest, NIHSS_ConsComm, NIHSS_Gaze, NIHSS_VisualFields, NIHSS_FacialParesis, NIHSS_MotorArmLeft, NIHSS_MotorArmRight, NI
HSS_MotorArmTotal, NIHSS_MotorLegLeft, NIHSS_MotorLegRight, NIHSS_MotorLegTotal, NIHSS_LimbAtaxia, NIHSS_Sensory, NIHSS_
Language, NIHSS_Dysarthria, NIHSS_Neglect, HeadsetSN, Coordinator
PandasGUI WARNING — pandasgui.utility — In DataFrame, converted columns to datetime: RecordingDate, Age, SystolicBP, Dia
stolicBP, HeartRate, Height, Weight, NIHSS_Cons, NIHSS_ConsQuest, NIHSS_ConsComm, NIHSS_Gaze, NIHSS_VisualFields, NIHSS_
FacialParesis, NIHSS_MotorArmLeft, NIHSS_MotorArmRight, NIHSS_MotorArmTotal, NIHSS_MotorLegLeft, NIHSS_MotorLegRight, NI
HSS_MotorLegTotal, NIHSS_LimbAtaxia, NIHSS_Sensory, NIHSS_Language, NIHSS_Dysarthria, NIHSS_Neglect, HeadsetSN

View:

> dfh = df.head()
> dfh
   UID  Subject  SiteID SubjectID  ... NIHSS_Dysarthria NIHSS_Neglect                     HeadsetSN      Coordinator
0    0        1       1    01-001  ...       1970-01-01    1970-01-01 1970-01-01 00:00:00.000101109  Matthew Mannion
1    0        1       1    01-001  ...       1970-01-01    1970-01-01 1970-01-01 00:00:00.000101109  Matthew Mannion
2    0        2       1    01-002  ...       1970-01-01    1970-01-01 1970-01-01 00:00:00.000101109      Lena Tabbaa
3    0        2       1    01-002  ...       1970-01-01    1970-01-01 1970-01-01 00:00:00.000101109      Lena Tabbaa
4    0        2       1    01-002  ...       1970-01-01    1970-01-01 1970-01-01 00:00:00.000101109      Lena Tabbaa
adamerose commented 3 years ago

Can you share the file for metadata_df.csv? I can't reproduce this, and your pickle seems to be the object after columns get converted to dates not before. PandasGUI should only convert columns to dates if they are str and if pd.to_datetime succeeds on the column.

rjsdotorg commented 3 years ago

Argg, sorry, that was a pkl with the converted columns. The columns being converted with warnings are object type in the pkl, rather than the newer StringDtype. I might need to be explicit in the upstream script that saves the pkl; if I convert to string as below, all is well.

Here is the head data properly: head.zip

Here is some CMD testing (df is the whole, long file dataframe, dfh is the head):

>>> dfh
   Unnamed: 0  UID  Subject  SiteID  ... NIHSS_Dysarthria NIHSS_Neglect HeadsetSN      Coordinator
0           0    0        1       1  ...              0.0           0.0  101109.0  Matthew Mannion
1           1    0        1       1  ...              0.0           0.0  101109.0  Matthew Mannion
2           2    0        2       1  ...              0.0           0.0  101109.0      Lena Tabbaa
3           3    0        2       1  ...              0.0           0.0  101109.0      Lena Tabbaa
4           4    0        2       1  ...              0.0           0.0  101109.0      Lena Tabbaa

[5 rows x 45 columns]
>>> dfh['Consciousness']
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Consciousness, dtype: object
>>> dfh.to_pickle(r'E:\projects\Jan Medical\software\quality\data\head.pkl')
>>> show(dfh)
PandasGUI INFO — pandasgui.gui — Opening PandasGUI...
PandasGUI WARNING — pandasgui.utility — In dfh, converted columns to datetime: TICIScore, IV_tPA_timing, EVT_timing, Rec
ordingDate, RecordingStartTime, RecordingStopTime, Consciousness
<pandasgui.gui.PandasGui object at 0x00000000223FF040>
>>>
>>> dfh.dtypes
Unnamed: 0               int64
UID                      int64
Subject                  int64
SiteID                   int64
SubjectID               object
Recording               object
Sequenze                object
Status                  object
StrokeType              object
Side                    object
Location                object
NIHSScore              float64
TICIScore               object
IV_tPA_timing           object
EVT_timing              object
RecordingDate           object
RecordingStartTime      object
RecordingStopTime       object
Gender                  object
Age                    float64
SystolicBP             float64
DiastolicBP            float64
HeartRate              float64
Height                 float64
Weight                 float64
Consciousness           object
NIHSS_Cons             float64
NIHSS_ConsQuest        float64
NIHSS_ConsComm         float64
NIHSS_Gaze             float64
NIHSS_VisualFields     float64
NIHSS_FacialParesis    float64
NIHSS_MotorArmLeft     float64
NIHSS_MotorArmRight    float64
NIHSS_MotorArmTotal    float64
NIHSS_MotorLegLeft     float64
NIHSS_MotorLegRight    float64
NIHSS_MotorLegTotal    float64
NIHSS_LimbAtaxia       float64
NIHSS_Sensory          float64
NIHSS_Language         float64
NIHSS_Dysarthria       float64
NIHSS_Neglect          float64
HeadsetSN              float64
Coordinator             object
dtype: object

If I convert to string, and do show(dfh) - no warnings

>>> dfh = dfh.astype("string")
>>> dfh.dtypes
Unnamed: 0             string
UID                    string
Subject                string
SiteID                 string
SubjectID              string
Recording              string
Sequenze               string
Status                 string
StrokeType             string
Side                   string
Location               string
NIHSScore              string
TICIScore              string
IV_tPA_timing          string
EVT_timing             string
RecordingDate          string
RecordingStartTime     string
RecordingStopTime      string
Gender                 string
Age                    string
SystolicBP             string
DiastolicBP            string
HeartRate              string
Height                 string
Weight                 string
Consciousness          string
NIHSS_Cons             string
NIHSS_ConsQuest        string
NIHSS_ConsComm         string
NIHSS_Gaze             string
NIHSS_VisualFields     string
NIHSS_FacialParesis    string
NIHSS_MotorArmLeft     string
NIHSS_MotorArmRight    string
NIHSS_MotorArmTotal    string
NIHSS_MotorLegLeft     string
NIHSS_MotorLegRight    string
NIHSS_MotorLegTotal    string
NIHSS_LimbAtaxia       string
NIHSS_Sensory          string
NIHSS_Language         string
NIHSS_Dysarthria       string
NIHSS_Neglect          string
HeadsetSN              string
Coordinator            string
dtype: object
>>> dfh['Consciousness']
0    <NA>
1    <NA>
2    <NA>
3    <NA>
4    <NA>
Name: Consciousness, dtype: string
>>> show(dfh)
PandasGUI INFO — pandasgui.gui — Opening PandasGUI...
<pandasgui.gui.PandasGui object at 0x00000000223F9F70>
>>>

However it does not convert RecordingDate string to datetime...

adamerose commented 3 years ago

The columns being converted with warnings are object type in the pkl, rather than the newer StringDtype

Ah yeah you're right, I convert columns to date if they're object type because that's often the type of columns containing strings, and I believe there's no way to differentiate between typical string columns and other types of object column like dict or whatnot, so I just try converting every object column to a date. StringDtype is actually unaccounted for, I'll fix that.

I loaded your pickle and ran parse_dates and these are the columns that got turned into dates on my end.

# This is good
RecordingDate          datetime64[ns]
RecordingStartTime     datetime64[ns]
RecordingStopTime      datetime64[ns]
# This is wrong. These `object` columns are all NaN and pd.to_datetime still converts them to datetime
TICIScore              datetime64[ns]
IV_tPA_timing          datetime64[ns]
EVT_timing             datetime64[ns]
Consciousness          datetime64[ns]
> pd.to_datetime(pd.Series([pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan]))
Out[3]: 
0   NaT
1   NaT
2   NaT
3   NaT
dtype: datetime64[ns]

So here are the issues I see:

  1. parse_dates is too aggressive and sometimes converts things to datetime when not desired. SOLUTION - I added an if-statement to check if the whole column is NaN and skip it in that case
  2. StringDtype doesn't get parsed into a date SOLUTION - Check for this as well as object
  3. Any logic for identifying if a column should be a date will not be perfect SOLUTION - I'm removing parse_dates from the automatic DataFrame cleaning on load, and instead adding it to the Edit menu as an optional action. I've also added this option to the context menu for column actions that pops up on right clicking headers
rjsdotorg commented 3 years ago

Very nice, thanks!