biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Edge case for biom.Table subsample with replacement #952

Closed mortonjt closed 3 months ago

mortonjt commented 4 months ago

This was originally posted on the qiime2 issue tracker https://github.com/qiime2/q2-feature-table/issues/245

I'm able to replicate it with some of my own data, and the results are a bit perplexing. When I pull out the original cython code and run it directly in ipython, it actually works -- suggesting to me that there is something off with the compilation of the cython code.

Below is the code to reproduce the error when I run table.subsample

import biom
table= biom.load_table('table.biom')
table.subsample(10, axis='sample', by_id=False, with_replacement=True)

And below is the code to directly compute the subsampling (borrowed from here), which doesn't have an error

import biom
table= biom.load_table('table.biom')
arr = table._get_sparse_data()

data, indptr = arr.data, arr.indptr
for i in range(indptr.shape[0] - 1):
        start, end = indptr[i], indptr[i+1]
         length = end - start
         counts_sum = data[start:end].sum()
         pvals = data[start:end] / counts_sum

         data[start:end] = rng.multinomial(n, pvals)

See attached zip file table.zip to reproduce this error. This was reproduced in biom versions 2.1.15, as well as 2.1.14 and 2.1.11

I'm also not seeing this issue when I'm using the dense subsampling in skbio

import biom
from skbio.stats import subsample_counts
table= biom.load_table('table.biom')
df = table.to_dataframe().T
df.apply(lambda x: subsample_counts(x.astype(np.int64), 1e6, replace=True))

EDIT : It looks like there is a type mis-alignment regarding the types. If I cast everything to ints, the issue appears to be resolved

import biom
table= biom.load_table('table.biom')
df = table.to_dataframe().apply(np.ceil).apply(np.int64).T
t = biom.Table(df.T.values, df.columns, df.index)
t.subsample(10, axis='sample', by_id=False, with_replacement=True)
wasade commented 4 months ago

Thanks, @mortonjt. But, the data are actually fractional so casting is lossy. Should we disallow subsampling on non-integer data? Or, should we instead round with cast rather than ceil?

In [9]: t = biom.load_table('table.biom')

In [10]: print(t.head())
# Constructed from biom file
#OTU ID 10A1-P1.mzML    10A2-P1.mzML    10C1-P1.mzML    10C2-P1.mzML    11A1-P1.mzML
4064    199.52254   174.41858   157.17462   216.63976   239.0588
173662  388.81177   379.29803   404.7725    380.4645    330.75812
3604    64.27305    37.126  13.994726   19.988409   1426.8013
4478    13.746785   5.8473034   6.119654    4.8788066   205.364
171437  68.62773    37.703896   47.83117    48.666946   34.73693

In [11]: t.matrix_data.data.dtype
Out[11]: dtype('float64')

In [12]:
mortonjt commented 4 months ago

Hi, I actually think an error is OK, it would help inform the users on how to circumvent this issue. I'm not crazy about rounding, since it would round entries between (0, 1) down to 0 -- which is terrible for low abundance features in some cases. I think this should be offloaded to the user.

wasade commented 4 months ago

Okay, sounds good -- thanks! So we for example, we could cast to int, test whether there are differences greater than tolerance to uncast data, and complain appropriately if so?

mortonjt commented 4 months ago

I'm OK with casting, provided that there is some type of warning (I'm not sure about testing for differences).

wasade commented 4 months ago

Thanks!

wasade commented 3 months ago

Hi @mortonjt, I cannot recreate this issue with the provided data and example, which is prohibiting creation of a regression test. Can you confirm this issue still exists in master?

$ git pull upstream master
From github.com:biocore/biom-format
 * branch            master     -> FETCH_HEAD
Already up to date.
$ md5 table.biom
MD5 (table.biom) = e340ba5071cbada0e918c322416ff3a3
$ python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:53:34) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import biom
>>> table = biom.load_table('table.biom')
>>> table.subsample(10, with_replacement=True)
311 x 148 <class 'biom.table.Table'> with 923 nonzero entries (2% dense)
>>>
wasade commented 3 months ago

Nevermind, looks like there may be a platform sensitivity. I caught it on Linux, can run with it from here.

>>> t.subsample(10, with_replacement=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mcdonadt/biom-format/biom/table.py", line 2989, in subsample
    _subsample(data, n, with_replacement, rng)
  File "biom/_subsample.pyx", line 162, in biom._subsample._subsample
    n : int
  File "biom/_subsample.pyx", line 52, in biom._subsample._subsample_with_replacement

  File "numpy/random/_generator.pyx", line 4019, in numpy.random._generator.Generator.multinomial
ValueError: sum(pvals[:-1]) > 1.0
>>>