Closed andreaniml closed 1 year ago
See the docstring for this tool rather than PopPUNK. If you've got version 2.0.0 you can run sketchlib
to see the options.
I know that with sourmash I can query a genome and it outputs its closest matches and also limit the db with a picklist, can I do it similarly with poppunk? Can I subset this h5 file? Is it compatible with current poppunk versions?
You can use --subset
to do this. You can also use query sparse
mode with --kNN
to get a list of top matches. The default is with core/accessory distances, but you can also use Jaccard distances if you want the same as mash/sourmash.
Hi
I still need a little help, I'm going to try to explain my reasoning:
I've selected ~800 genomes from the 661k genomes database, which I'll call group A.
I'm interested in gathering close genomes to those that are in group A, more specifically I narrowed my search range down to 70k genomes, I'll call it group B
From my understanding, if I use --subset with a file with all the samples (group A + group B) and, maybe --kNN 10, I'm going to find for each sample the top 10 matches (which I will have to sort out the matches of group A I'm after).
However, I'm curious, is there a way to extract group B creating a new (and smaller) h5 file? I'm guessing it would be quicker if I could just grab the top 10 closest to group A. Am I thinking it right that the way I'm currently doing it I would be seeing the top 10 of any of the group B? I'm also worried that the top 10 matches of many of the group A samples would be other group A samples, and I'm trying to gather a context for group A.
Looking at main.py I was thinking on doing:
import pandas as pd
import numpy as np
import h5py
import pp_sketchlib
Myselection= pd.read_csv('Samples.csv') # DF with one column with the sketches names
BigDB = h5py.File(Bigdb_prefix + ".h5", 'r')
SmallerDB = BigDB [BigDB ['sketches'].isin(Myselection['samples'])]
h5py.File(SmallerDB)
However I think that these h5 files do not behave as dataframes, so my code would not do what I want it to do, did I go too far and there is a simpler way?
Many thanks and sorry for the long question!
PS: I am not expecting you to write the code for me if it's necessary for what I want to do, but as I've never worked with these kind of files I need a little push in the right direction
Hello!
I've done some digging in the documentation of h5py and I got to this, I've tested the code with a mock dataset and it seems to work, could you confirm it?
# My idea is to get one h5 file for group B and one file for each member of group A in order to create a bash script with sketchlib and query each sample individualy
import numpy as np
import h5py
import pp_sketchlib
import pandas as pd
GroupB = h5py.File('GroupB.h5', 'w')
GroupB.create_group('sketches')
BigDB=h5py.File('661kdb.h5','r')
MySamples=pd.read_csv(MySamples.csv) # Only the GroupB sample names
GroupB_list=list(MySamples['sample_id'])
with GroupB as f_dest:
for item in GroupB_list:
if item in BigDB['sketches'].keys():
BigDB.copy(BigDB['sketches/'+item],f_dest["/sketches"])
GroupB .close()
### Now the individual files
GroupA= pd.read_csv('GroupA.csv)
GroupA_list=list(GroupA['sample_id'])
for item in GroupA_list:
hf_tmp = h5py.File(item + '.h5', 'w')
hf_tmp.create_group('sketches')
if item in BigDB['sketches'].keys():
BigDB.copy(BigDB['sketches/'+item],hf_tmp["/sketches"])
This code would create the .h5 files that I need to do a bash script like
for fiile in S*.h5
do
sketchlib query dist GroupB.h5 ${file} > Query_${file}.csv
done
(I have not tested the bash script yet)
So then I would have all the queries, I could sort and get the first x matches I want.
I have one more question: the original database has a "random" partition, do I need to copy it to the subseted ones? Should I create them aggain? I'm not sure if it affects the distance calculation (I think it doesn't, but I wanted to check)
Yes sorry, we don't currently have an option to edit the databases – perhaps we could add this. The 'supported' way is through PopPUNK with --run-qc
which lets you remove samples from a list, but this probably won't really work for a sketchlib-centric application.
But it looks like you have the code right. Useful references might be:
It might be more efficient to batch your queries rather than use the bash loop (i.e. all the queries in one file rather than lots of files)
I have one more question: the original database has a "random" partition, do I need to copy it to the subseted ones? Should I create them aggain? I'm not sure if it affects the distance calculation (I think it doesn't, but I wanted to check)
Best thing to do is remove it, then add it back in with sketchlib add random <h5 file>
– you would do this for the reference database. But if you're using longer k-mer lengths (>17 for typical bacteria) then it doesn't have much effect and you can remove it.
Hi, the database was created with the default settings:
Running: BigDB['sketches']['SAMN10241008'].attrs['kmers'] I get: array([13, 16, 19, 22, 25, 28], dtype=int32)
I have kmers of 13 and 16 alongside larger kmers, better to use 'add random' then?
Yeah, I would suggest doing so. If you're having trouble, another option would be to exclude 13-mers by providing --kseq 16,28,3
Hi!
I'm trying to subset the 661k genomes database (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001421). I know it has a h5 sketchlib file (661_ppsketch_v1.5.h5), however, I don't know what exactly I can do with it.
I know that with sourmash I can query a genome and it outputs its closest matches and also limit the db with a picklist, can I do it similarly with poppunk? Can I subset this h5 file? Is it compatible with current poppunk versions?
Cheers!