balabanmetin / apples

distance based phylogenetic placement
GNU General Public License v3.0
24 stars 5 forks source link

Just obtain a distance matrix #13

Closed bananabenana closed 1 year ago

bananabenana commented 1 year ago

Hi,

Great tool you have developed here. I was wondering if there is a way to essentially get just the Scoredist-like pairwise distance matrix from an MSA, rather than tree building?

Is this a possibility? Apologies if this is too far off the intended scope

Thanks

balabanmetin commented 1 year ago

Hi Ben,

Thanks for using apples-2. It is not possible to output the distance matrix using available options. however it would be useful to add this feature. Until I implement this, one possible way to do this to add a print statement here and run apples-2 with options "-f 0 -b 999999" so that the distance between every query and reference is computed.

Alternatively, parse the MSA yourself using the code here and compute pairwise distances using Scoredist code here.

bananabenana commented 1 year ago

Hi @balabanmetin,

Thanks for your fast response.

Thanks for using apples-2. It is not possible to output the distance matrix using available options. however it would be useful to add this feature. Until I implement this, one possible way to do this to add a print statement here and run apples-2 with options "-f 0 -b 999999" so that the distance between every query and reference is computed.

This option requires an input tree (which I don't have - as I am trying to get distances for >80k MSAs)

Alternatively, parse the MSA yourself using the code here and compute pairwise distances using Scoredist code here.

I see what you mean, however I am unable to actually pull this off - skill issue. Thanks for your guidance though!

I guess, what I am really after, is being able to input a sole MSA and get pairwise Scoredists. I think I am after something which is outside the core functionality of apples-2, do you know of any other implementations of Scoredist that provide this sort of functionality?

balabanmetin commented 1 year ago
import sys
from apples.fasta2dic import fasta2dic
from apples.distance import scoredist

refs = fasta2dic(sys.argv[1], True, False)

print('\t' + '\t'.join(refs.keys()))
for k,v in refs.items():
        print(k, end="")
        for k2,v2 in refs.items():
                print("\t" + str(scoredist(v,v2,0)), end="")
        print("")

Run this script at the main repo directory and give it the MSA file name, it should print the distance matrix

However, this will be slow so you need to optimize this code to compute a 80K by 80K distance matrix in reasonable time. I do not know any other tool that can do this efficiently.

bananabenana commented 1 year ago

Okay wow that does it. That is perfect. I really appreciate that additional script - will be citing the apples-2 paper! Results > efficiency at this point!