chrisquince / DESMAN

De novo Extraction of Strains from MetAgeNomes
Other
69 stars 22 forks source link

Memory requirements, number of strains #41

Open mherold1 opened 4 years ago

mherold1 commented 4 years ago

Hello,

I would like to use desman on a large set of samples with a relatively small number of pre-defined SCGs. So I am just using the mappings of several samples to this set of SCGs and adjusted the following steps from the complete example: determine variants and strain inference

As I don't know the number of strains/haplotypes to expect, so I was using the number of samples as maximum number:

num_samples=41
for g in $(seq 1 $num_samples); do
    for r in 0 1 2 3 4; do
        desman ../Variants/outputsel_var.csv -e ../Variants/outputtran_df.csv -o ClusterEC_${g}_${r} -i 500 -g $g -s $r > ClusterEC_${g}_${r}.out"
    done;
done

My issue is that while this command works for smaller numbers of g, the requirement for memory seems to increase above what is provided and for g=12 desman fails with the following error:

desman ../Variants/outputsel_var.csv -e ../Variants/outputtran_df.csv -o ClusterEC_12_0 -i 500 -g 12 -s 0 > ClusterEC_12_0.out
...
[1]+  Bus error               desman ../Variants/outputsel_var.csv -e ../Variants/outputtran_df.csv -o ClusterEC_12_0 -i 500 -g 12 -s 0 > ClusterEC_12_0.out

More detailed error message:

Traceback (most recent call last):
  File "miniconda3/bin/desman", line 4, in <module>
    __import__('pkg_resources').run_script('desman==2.1.1', 'desman')
  File "miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
    exec(code, namespace, namespace)
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 246, in <module>
    main(sys.argv[1:])
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 138, in main
    haplo_SNP = hsnp.HaploSNP_Sampler(variant_Filter.snps_filter, genomes, prng, max_iter=no_iter)
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 96, in __init__
    temparray = du.cartesian(t1)
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 63, in cartesian
    out = np.zeros([n, len(arrays)], dtype=dtype)
MemoryError: Unable to allocate array with shape (1073741824, 15) and data type int64

Also for larger numbers of g (41 here) it becomes a slightly different error:

...
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 68, in cartesian
    cartesian(arrays[1:], out=out[0:m,1:])
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 68, in cartesian
    cartesian(arrays[1:], out=out[0:m,1:])
  [Previous line repeated 7 more times]
  File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 66, in cartesian
    out[:,0] = np.repeat(arrays[0], m)
  File "<__array_function__ internals>", line 6, in repeat
  File "miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 481, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

Are there any recommendations in how many samples a strain must occur to be identified? Also is there a way to calculate memory requirements given a large number of expected strains? Would I need to "pre-partition" my samples beforehand to reduce complexity?