AFM-SPM / TopoStats

An AFM image analysis program to batch process data and obtain statistics from images
https://afm-spm.github.io/TopoStats/
GNU Lesser General Public License v3.0
60 stars 11 forks source link

Fixes `pd.concat` error halting processing #972

Closed MaxGamill-Sheffield closed 1 month ago

MaxGamill-Sheffield commented 1 month ago

closes #969

TLDR: PR adds the following:

Think I got it. The cause according to the docs is that pandas doesn't like concatenating objects with whole columns (or rows) as NaN's or None's (we had a mix which is why it was hard to find).

Not only that, but for some reason the "warning" also halts processing.

The initial fix I tried was to stop the warning using the below code which seemed to work, so maybe a 🔥hotfix🔥 idea for something in the future:

import warnings
...
warnings.filterwarnings("ignore")

Then I thought... WWND (what would @ns-rse do) 👼, and so I solved the root issue and stopped the message from popping up in the first place by dropping all columns with all values as None or NaN's, and applied this across:

Now I've tested this slightly to ensure that if a column is present in df1 but not df2, it will be added, but filled with NaNs for the values in the resultant_df for df2.

SylviaWhittle commented 1 month ago

Hey, I've read the PR and it looks great, just two things:

1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans in - what am I doing wrong here? 😅

image image

2 - Removing the columns that contain all np.nan values - could this lead to incomplete grainstats.csvs? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csvs but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan?

MaxGamill-Sheffield commented 1 month ago

1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans in - what am I doing wrong here? 😅

Yep! I tried the same and it was driving me mad. I still have no idea of the difference 🤷‍♂️

2 - Removing the columns that contain all np.nan values - could this lead to incomplete grainstats.csvs? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csvs but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan?

And also yes, if somehow the same column in all images returned None / NaN 's then that column would be missing. This would cause a problem when searching for that column to plot, but then again so would trying to plot values for a column of all NaN's. So I guess it's up to us wether we want to add it. Which side of the fence do you lie @SylviaWhittle?

SylviaWhittle commented 1 month ago

1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans in - what am I doing wrong here? 😅

Yep! I tried the same and it was driving me mad. I still have no idea of the difference 🤷‍♂️

2 - Removing the columns that contain all np.nan values - could this lead to incomplete grainstats.csvs? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csvs but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan?

And also yes, if somehow the same column in all images returned None / NaN 's then that column would be missing. This would cause a problem when searching for that column to plot, but then again so would trying to plot values for a column of all NaN's. So I guess it's up to us wether we want to add it. Which side of the fence do you lie @SylviaWhittle?

I say that 1 - this is unlikely and 2 - this would cause an error anyways so from a user perspective then it isn't really solving anything so until someone complains and it becomes a legit issue, let's not try to preempt something that might not be an issue?

llwiggins commented 1 month ago

Thanks for your nice detective work on this @MaxGamill-Sheffield! 🕵️ This seems a sensible fix to me and enables the data set I mentioned in issue #969 to run through to completion and successfully produces the all_statistics.csv.

Happy for this to be merged!

MaxGamill-Sheffield commented 1 month ago

I've added to the usage docs in the the tracing outputs which were missing, and the warning about the columns so it is documented somewhere not in our brains :)