Closed MaxGamill-Sheffield closed 1 month ago
Hey, I've read the PR and it looks great, just two things:
1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans
in - what am I doing wrong here? 😅
2 - Removing the columns that contain all np.nan
values - could this lead to incomplete grainstats.csv
s? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csv
s but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan
?
1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with
np.nans
in - what am I doing wrong here? 😅
Yep! I tried the same and it was driving me mad. I still have no idea of the difference 🤷♂️
2 - Removing the columns that contain all
np.nan
values - could this lead to incompletegrainstats.csv
s? Would this cause an issue from a user perspective? (ie trying to plot a column present in somegrainstats.csv
s but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill withnp.nan
?
And also yes, if somehow the same column in all images returned None / NaN 's then that column would be missing. This would cause a problem when searching for that column to plot, but then again so would trying to plot values for a column of all NaN's. So I guess it's up to us wether we want to add it. Which side of the fence do you lie @SylviaWhittle?
1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with
np.nans
in - what am I doing wrong here? 😅Yep! I tried the same and it was driving me mad. I still have no idea of the difference 🤷♂️
2 - Removing the columns that contain all
np.nan
values - could this lead to incompletegrainstats.csv
s? Would this cause an issue from a user perspective? (ie trying to plot a column present in somegrainstats.csv
s but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill withnp.nan
?And also yes, if somehow the same column in all images returned None / NaN 's then that column would be missing. This would cause a problem when searching for that column to plot, but then again so would trying to plot values for a column of all NaN's. So I guess it's up to us wether we want to add it. Which side of the fence do you lie @SylviaWhittle?
I say that 1 - this is unlikely and 2 - this would cause an error anyways so from a user perspective then it isn't really solving anything so until someone complains and it becomes a legit issue, let's not try to preempt something that might not be an issue?
Thanks for your nice detective work on this @MaxGamill-Sheffield! 🕵️ This seems a sensible fix to me and enables the data set I mentioned in issue #969 to run through to completion and successfully produces the all_statistics.csv.
Happy for this to be merged!
I've added to the usage docs in the the tracing outputs which were missing, and the warning about the columns so it is documented somewhere not in our brains :)
closes #969
TLDR: PR adds the following:
Think I got it. The cause according to the docs is that pandas doesn't like concatenating objects with whole columns (or rows) as NaN's or None's (we had a mix which is why it was hard to find).
Not only that, but for some reason the "warning" also halts processing.
The initial fix I tried was to stop the warning using the below code which seemed to work, so maybe a 🔥hotfix🔥 idea for something in the future:
Then I thought... WWND (what would @ns-rse do) 👼, and so I solved the root issue and stopped the message from popping up in the first place by dropping all columns with all values as None or NaN's, and applied this across:
Now I've tested this slightly to ensure that if a column is present in
df1
but notdf2
, it will be added, but filled with NaNs for the values in theresultant_df
fordf2
.