bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
86 stars 17 forks source link

Generating jaccard distances per kmer #284

Closed JLC2141 closed 9 months ago

JLC2141 commented 9 months ago

popunk version: 2.6.0

I am attempting to re-create the poppunk_sketch jaccard distance table as shown in this previous issue: https://github.com/bacpop/PopPUNK/issues/167#issuecomment-843873526

However, I am unable to use poppunk_sketch in my current version of poppunk. My current workaround is as follows:

sketchlib sketch -l files.txt -o database -s 1000 -k 15,30,3 --cpus 40 sketchlib query jaccard database -o dists --cpus 40 poppunk_extract_distances.py --distances dists --output distances.tab

Where the output from poppunk_extract_distances.py in the "Core" and "Accessory" columns appears to be the jaccard distances for the first two kmers of kseq specified in the "sketchlib sketch" function.

Is there a simpler approach to output a table of jaccard distances per kmer?

JLC2141 commented 9 months ago

Here's some additional information: pp-sketchlib v2.1.1

Installations: Poppunk Install Conda create --name poppunk conda activate poppunk python3 -mpip install poppunk

pp-sketchlib Install sudo apt install cmake gfortran libarmadillo-dev libeigen3-dev libopenblas-dev pip3 install pp-sketchlib

johnlees commented 9 months ago

Have you tried just omitting the output of the query step:

sketchlib sketch -l files.txt -o database -s 1000 -k 15,30,3 --cpus 40
sketchlib query jaccard database --cpus 40 > distances.tab
JLC2141 commented 9 months ago

Thank you