I would like to use desman on a large set of samples with a relatively small number of pre-defined SCGs. So I am just using the mappings of several samples to this set of SCGs and adjusted the following steps from the complete example: determine variants and strain inference
As I don't know the number of strains/haplotypes to expect, so I was using the number of samples as maximum number:
num_samples=41
for g in $(seq 1 $num_samples); do
for r in 0 1 2 3 4; do
desman ../Variants/outputsel_var.csv -e ../Variants/outputtran_df.csv -o ClusterEC_${g}_${r} -i 500 -g $g -s $r > ClusterEC_${g}_${r}.out"
done;
done
My issue is that while this command works for smaller numbers of g, the requirement for memory seems to increase above what is provided and for g=12 desman fails with the following error:
Traceback (most recent call last):
File "miniconda3/bin/desman", line 4, in <module>
__import__('pkg_resources').run_script('desman==2.1.1', 'desman')
File "miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
exec(code, namespace, namespace)
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 246, in <module>
main(sys.argv[1:])
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 138, in main
haplo_SNP = hsnp.HaploSNP_Sampler(variant_Filter.snps_filter, genomes, prng, max_iter=no_iter)
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 96, in __init__
temparray = du.cartesian(t1)
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 63, in cartesian
out = np.zeros([n, len(arrays)], dtype=dtype)
MemoryError: Unable to allocate array with shape (1073741824, 15) and data type int64
Also for larger numbers of g (41 here) it becomes a slightly different error:
...
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 68, in cartesian
cartesian(arrays[1:], out=out[0:m,1:])
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 68, in cartesian
cartesian(arrays[1:], out=out[0:m,1:])
[Previous line repeated 7 more times]
File "miniconda3/lib/python3.7/site-packages/desman-2.1.1-py3.7-linux-x86_64.egg/desman/Desman_Utils.py", line 66, in cartesian
out[:,0] = np.repeat(arrays[0], m)
File "<__array_function__ internals>", line 6, in repeat
File "miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 481, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
return bound(*args, **kwds)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
Are there any recommendations in how many samples a strain must occur to be identified? Also is there a way to calculate memory requirements given a large number of expected strains? Would I need to "pre-partition" my samples beforehand to reduce complexity?
Hello,
I would like to use desman on a large set of samples with a relatively small number of pre-defined SCGs. So I am just using the mappings of several samples to this set of SCGs and adjusted the following steps from the complete example: determine variants and strain inference
As I don't know the number of strains/haplotypes to expect, so I was using the number of samples as maximum number:
My issue is that while this command works for smaller numbers of
g
, the requirement for memory seems to increase above what is provided and forg=12
desman fails with the following error:More detailed error message:
Also for larger numbers of g (41 here) it becomes a slightly different error:
Are there any recommendations in how many samples a strain must occur to be identified? Also is there a way to calculate memory requirements given a large number of expected strains? Would I need to "pre-partition" my samples beforehand to reduce complexity?