Open ArtPoon opened 5 years ago
Turning off bootstrap sampling makes a big difference:
Elzar:poplars artpoon$ time python3 riplike.py -nrep 0 ref_genomes/K03455.fasta test.out
K03455|HIVHXB2CG
real 0m6.893s
user 0m6.578s
sys 0m0.256s
Replacing random.randint
with random.random
seems to have made a big difference:
Elzar:poplars artpoon$ time python3 riplike.py ref_genomes/K03455.fasta test.out
K03455|HIVHXB2CG
real 0m35.956s
user 0m35.614s
sys 0m0.278s
riplike is very slow on Windows (possibly due to the MAFFT version). I think pdist
and bootstrap
could be made faster.
pdist
time = 22 seconds
bootstrap
time = 101 seconds.
def bootstrap(s1, s2, reps=100):
...
for rep in range(reps):
result = []
bootstrap = [random.randint(0, seqlen-1) for _ in range(seqlen)]
b1 = ''.join([s1[i] for i in bootstrap])
b2 = ''.join([s2[i] for i in bootstrap])
yield b1, b2
The string joining in bootstrap
seems slow and may not be necessary. pdist
could be modified to use a list.
Using NumPy arrays in pdist
and bootstrap
(see changes in commit 2d12ba5) seems to improve performance.
pdist
time = 24 seconds
bootstrap
time = 5 seconds
I think that the implementation of random.randint
in Python is exceedingly slow, try using random.random
in combination with round
instead.
Also we could pass a vector of differences (binary state) instead and resample that, to avoid a lot of unnecessary calculation.
See #22
On my Mac at home (admittedly a slow machine):
This same query takes about 7 seconds on the LANL server.
First I'm going to see if the bootstrap step can be made faster.