Closed RodolfoSalido closed 1 year ago
@RodolfoSalido I don't believe this fix is correct, actually. As you can see below, both versions of your formula will return 'NaN'. This is because 'NaN' is the value for the divisor and anything divided by 'not-a-number' can only be 'not-a-number':
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:55:37) [Clang 14.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information.
import pandas as pd data = [10,20,30,40,float('NaN'),60] df = pd.DataFrame(data, columns=['proportion']) prop = df['proportion'] max(prop)/prop 0 6.0 1 3.0 2 2.0 3 1.5 4 NaN 5 1.0 Name: proportion, dtype: float64 df['proportion'].max()/df['proportion'] 0 6.0 1 3.0 2 2.0 3 1.5 4 NaN 5 1.0 Name: proportion, dtype: float64
I did test the change and wrote a test for it to be sure. @RodolfoSalido can you think about what is the proper way to handle things if a row contains a 'NaN' value in the proportions column? Do we convert it to 0 or some other number? Do we omit the entire column? Do we do something else? I don't know what's the best way to approach that, actually.
@RodolfoSalido https://github.com/biocore/metagenomics_pooling_notebook/pull/118 implements this change as is. Before we approve and merge it, we should replace this code w/defined behavior for handling 'NaN.'
Thanks Charlie! Looks like the environment that the metagenomics_pooling_notebook is deployed in behaves differently than where you are testing the snippet you shared.
The max() function returns a NaN in the jupyter environment when there's a NaN in the input vector, which propagates to NaNs for all divisions cause of the NaN in the numerator. The pandas math works fine in contrast.
As to handling NaNs in the output (NaNs in LoadingFactor), the next two lines in metapool.py actually handle that. https://github.com/biocore/metagenomics_pooling_notebook/blob/6d4ca83ff5b625c759f997b45ec18854145f57c2/metapool/metapool.py#L1240-L1241
addressed in #123
https://github.com/biocore/metagenomics_pooling_notebook/blob/6d4ca83ff5b625c759f997b45ec18854145f57c2/metapool/metapool.py#L1238-L1239
The max() function returns nan when there's a nan in the input, breaking the math on these lines. Using pandas math instead solves the issue
plate_df['LoadingFactor'] = plate_df['proportion'].max()/plate_df['proportion']
Could we make this change ?