PoonLab / Poplars

Open-source implementations of popular tools from Los Alamos National Laboratory HIV Database
GNU Affero General Public License v3.0
0 stars 0 forks source link

riplike: script is too slow compared to LANL tool #12

Open ArtPoon opened 5 years ago

ArtPoon commented 5 years ago

On my Mac at home (admittedly a slow machine):

Elzar:poplars artpoon$ time python3 riplike.py ref_genomes/K03455.fasta test.out 
K03455|HIVHXB2CG

real    2m35.830s
user    2m31.287s
sys 0m0.750s

This same query takes about 7 seconds on the LANL server.

First I'm going to see if the bootstrap step can be made faster.

ArtPoon commented 5 years ago

Turning off bootstrap sampling makes a big difference:

Elzar:poplars artpoon$ time python3 riplike.py -nrep 0 ref_genomes/K03455.fasta test.out
K03455|HIVHXB2CG

real    0m6.893s
user    0m6.578s
sys 0m0.256s
ArtPoon commented 5 years ago

Replacing random.randint with random.random seems to have made a big difference:

Elzar:poplars artpoon$ time python3 riplike.py  ref_genomes/K03455.fasta test.out
K03455|HIVHXB2CG

real    0m35.956s
user    0m35.614s
sys 0m0.278s
kwade4 commented 5 years ago

riplike is very slow on Windows (possibly due to the MAFFT version). I think pdist and bootstrap could be made faster.

pdist time = 22 seconds bootstrap time = 101 seconds.

Bootstrap
def bootstrap(s1, s2, reps=100):
...

    for rep in range(reps):
        result = []
        bootstrap = [random.randint(0, seqlen-1) for _ in range(seqlen)]        
        b1 = ''.join([s1[i] for i in bootstrap])
        b2 = ''.join([s2[i] for i in bootstrap])
        yield b1, b2

The string joining in bootstrap seems slow and may not be necessary. pdist could be modified to use a list.

NumPy

Using NumPy arrays in pdist and bootstrap (see changes in commit 2d12ba5) seems to improve performance.

pdist time = 24 seconds bootstrap time = 5 seconds

ArtPoon commented 5 years ago

I think that the implementation of random.randint in Python is exceedingly slow, try using random.random in combination with round instead. Also we could pass a vector of differences (binary state) instead and resample that, to avoid a lot of unnecessary calculation.

ArtPoon commented 5 years ago

See #22