biocore / metagenomics_pooling_notebook

Jupyter notebooks to assist with sample processing
MIT License
8 stars 16 forks source link

nans break calculate_iseqnorm_pooling_volumes() #117

Closed RodolfoSalido closed 1 year ago

RodolfoSalido commented 1 year ago

https://github.com/biocore/metagenomics_pooling_notebook/blob/6d4ca83ff5b625c759f997b45ec18854145f57c2/metapool/metapool.py#L1238-L1239

The max() function returns nan when there's a nan in the input, breaking the math on these lines. Using pandas math instead solves the issue

plate_df['LoadingFactor'] = plate_df['proportion'].max()/plate_df['proportion']

Could we make this change ?

charles-cowart commented 1 year ago

@RodolfoSalido I don't believe this fix is correct, actually. As you can see below, both versions of your formula will return 'NaN'. This is because 'NaN' is the value for the divisor and anything divided by 'not-a-number' can only be 'not-a-number':

Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:55:37) [Clang 14.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd data = [10,20,30,40,float('NaN'),60] df = pd.DataFrame(data, columns=['proportion']) prop = df['proportion'] max(prop)/prop 0 6.0 1 3.0 2 2.0 3 1.5 4 NaN 5 1.0 Name: proportion, dtype: float64 df['proportion'].max()/df['proportion'] 0 6.0 1 3.0 2 2.0 3 1.5 4 NaN 5 1.0 Name: proportion, dtype: float64

I did test the change and wrote a test for it to be sure. @RodolfoSalido can you think about what is the proper way to handle things if a row contains a 'NaN' value in the proportions column? Do we convert it to 0 or some other number? Do we omit the entire column? Do we do something else? I don't know what's the best way to approach that, actually.

charles-cowart commented 1 year ago

@RodolfoSalido https://github.com/biocore/metagenomics_pooling_notebook/pull/118 implements this change as is. Before we approve and merge it, we should replace this code w/defined behavior for handling 'NaN.'

RodolfoSalido commented 1 year ago

Thanks Charlie! Looks like the environment that the metagenomics_pooling_notebook is deployed in behaves differently than where you are testing the snippet you shared.

The max() function returns a NaN in the jupyter environment when there's a NaN in the input vector, which propagates to NaNs for all divisions cause of the NaN in the numerator. The pandas math works fine in contrast. Screen Shot 2023-04-27 at 10 27 40 AM

As to handling NaNs in the output (NaNs in LoadingFactor), the next two lines in metapool.py actually handle that. https://github.com/biocore/metagenomics_pooling_notebook/blob/6d4ca83ff5b625c759f997b45ec18854145f57c2/metapool/metapool.py#L1240-L1241

RodolfoSalido commented 1 year ago

addressed in #123