ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
121 stars 6 forks source link

Columns containing missing values will render entire column as `NaN` #169

Closed danielcs88 closed 9 months ago

danielcs88 commented 9 months ago
(
    fastexcel.read_excel(file)
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]
)

When reading an Excel file, if a column has missing values, it will render the entire column as NaN (even if it is not missing).

image

Upon running .info()

(
    fastexcel.read_excel(file)
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]
    .info()
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9438 entries, 0 to 9437
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              9435 non-null   object 
 1   Transaction Type  9435 non-null   object 
 2   Num               9435 non-null   float64
 3   Name              9435 non-null   object 
 4   Memo/Description  0 non-null      object 
 5   Due Date          9435 non-null   object 
 6   Amount            9435 non-null   float64
 7   Open Balance      9435 non-null   float64
 8   Jurisdiction      0 non-null      object 
 9   Project Manager   0 non-null      object 
 10  A/R Paid          9435 non-null   object 
 11  Project Address   0 non-null      object 
 12  Permit No         0 non-null      object 
 13  Sales Rep         0 non-null      object 
dtypes: float64(3), object(11)
memory usage: 1.0+ MB
lukapeschke commented 9 months ago

I believe this will be fixed by https://github.com/ToucanToco/fastexcel/pull/164 , which should hopefully get released this week :slightly_smiling_face:

It would be great if you could provide a small xlsx file reproducing this specific bug, so we can check if this is the same bug as #158 .

danielcs88 commented 9 months ago

Hi @lukapeschke

I created this sample spreadsheet

image

When using the same code as initially

(
    fastexcel.read_excel("/Users/daniel/Downloads/Sample spreadsheet.xlsx")
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]   
)
         Date Transaction Type     Num                 Name  ... A/R Paid          Project Address  Permit No  Sales Rep
0  2024-03-11          Invoice  4718.0        Clayton Moses  ...     Paid  7988 Cobblestone Street       None       None
1  2024-03-20          Invoice  5391.0           Zayne Mays  ...     Paid      99 Westminster Lane       None       None
2  2024-03-27          Invoice  1082.0      Terrence Bishop  ...     Paid       9 South Forest Dr.       None       None
3  2024-05-27          Invoice  7518.0            Dane Lowe  ...     Paid     9981 High Ridge Lane       None       None
4  2024-09-16          Invoice  4605.0         Jaylon Moyer  ...     Paid          187 Theatre St.       None       None
5  2024-10-07          Invoice  2228.0         Kiana Carson  ...     Paid                     None       None       None
6  2024-10-25          Invoice  8367.0          Ean Higgins  ...     Paid          7521 Beach Lane       None       None
7  2024-12-05          Invoice  1759.0           Cora Smith  ...     Paid      96 Stonybrook Court       None       None
8  2024-12-10          Invoice  5933.0  Allison Fitzpatrick  ...     Paid           574 Willow St.       None       None
9  2024-12-23          Invoice  6176.0        Giada Daniels  ...     Paid        62 NW. Park Drive       None       None
10        NaT             None     NaN                 None  ...     None                     None       None       None
11        NaT             None     NaN                 None  ...     None                     None       None       None
12        NaT             None     NaN                 None  ...     None                     None       None       None

[13 rows x 14 columns]

As last time, the columns in which the data is sparse, e.g., Permit No and Sales Rep which have some missing values are rendered into columns that only have NaN, specifically None.

danielcs88 commented 9 months ago

It looks like my issue is the same as #160

PrettyWood commented 9 months ago

We'll try to make a new release this week. I just came back from holidays and will review the fix tomorrow

PrettyWood commented 9 months ago

indeed duplicate of #160