Optimize removeIndices - Githubissues

ArtPoon commented 4 years ago

These lines:

    # for each entry in dataList, remove the irrelevant columns
    while len(dataList) > 0:
        line = dataList.pop(0)

        finalLine = []

        for index in range(len(line)):
            if index in indiciesToKeep:
                finalLine.extend(line[index].vector)

        finalList.append(finalLine)

are unnecessarily iterating over every position of each genome - it should be faster to iterate over indiciesToKeep only:

        for index in indiciesToKeep:
            if index < len(line):
                finalLine.extend(line[index].vector)

ArtPoon commented 4 years ago

Timing with 100 genomes sampled from UK, original code:

(pangolin) art@orolo:~/work/sc2-clustering/data$ pangolin --outfile uk100.out uk100.fa
...
reading in data 07/27/2020, 11:50:22
removing unnecessary columns 07/27/2020, 11:50:26
loading model 07/27/2020, 11:56:07
generating predictions 07/27/2020, 11:56:08

With modified version:

(pangolin) art@orolo:~/work/sc2-clustering/data$ pangolin --outfile uk100-2.out uk100.fa 
...
reading in data 07/27/2020, 11:46:04
removing unnecessary columns 07/27/2020, 11:46:08
constructing data frame07/27/2020, 11:46:09
loading model 07/27/2020, 11:46:25
generating predictions 07/27/2020, 11:46:26

Outputs are identical:

(pangolin) art@orolo:~/work/sc2-clustering/data$ diff uk100.out uk100-2.out

ArtPoon commented 4 years ago

Filing pull request

ArtPoon / pangolin

Optimize removeIndices #1