bacpop / pp-sketchlib

Library of sketching functions used by PopPUNK
Apache License 2.0
30 stars 8 forks source link

Dealing with h5 files #81

Closed andreaniml closed 1 year ago

andreaniml commented 2 years ago

Hi!

I'm trying to subset the 661k genomes database (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001421). I know it has a h5 sketchlib file (661_ppsketch_v1.5.h5), however, I don't know what exactly I can do with it.

I know that with sourmash I can query a genome and it outputs its closest matches and also limit the db with a picklist, can I do it similarly with poppunk? Can I subset this h5 file? Is it compatible with current poppunk versions?

Cheers!

johnlees commented 2 years ago

See the docstring for this tool rather than PopPUNK. If you've got version 2.0.0 you can run sketchlib to see the options.

I know that with sourmash I can query a genome and it outputs its closest matches and also limit the db with a picklist, can I do it similarly with poppunk? Can I subset this h5 file? Is it compatible with current poppunk versions?

You can use --subset to do this. You can also use query sparse mode with --kNN to get a list of top matches. The default is with core/accessory distances, but you can also use Jaccard distances if you want the same as mash/sourmash.

andreaniml commented 2 years ago

Hi

I still need a little help, I'm going to try to explain my reasoning:

I've selected ~800 genomes from the 661k genomes database, which I'll call group A.

I'm interested in gathering close genomes to those that are in group A, more specifically I narrowed my search range down to 70k genomes, I'll call it group B

From my understanding, if I use --subset with a file with all the samples (group A + group B) and, maybe --kNN 10, I'm going to find for each sample the top 10 matches (which I will have to sort out the matches of group A I'm after).

However, I'm curious, is there a way to extract group B creating a new (and smaller) h5 file? I'm guessing it would be quicker if I could just grab the top 10 closest to group A. Am I thinking it right that the way I'm currently doing it I would be seeing the top 10 of any of the group B? I'm also worried that the top 10 matches of many of the group A samples would be other group A samples, and I'm trying to gather a context for group A.

Looking at main.py I was thinking on doing:

import pandas as pd
import numpy as np
import h5py
import pp_sketchlib

Myselection= pd.read_csv('Samples.csv') # DF with one column with the sketches names

BigDB = h5py.File(Bigdb_prefix + ".h5", 'r')
SmallerDB = BigDB [BigDB ['sketches'].isin(Myselection['samples'])]

h5py.File(SmallerDB)

However I think that these h5 files do not behave as dataframes, so my code would not do what I want it to do, did I go too far and there is a simpler way?

Many thanks and sorry for the long question!

PS: I am not expecting you to write the code for me if it's necessary for what I want to do, but as I've never worked with these kind of files I need a little push in the right direction

andreaniml commented 2 years ago

Hello!

I've done some digging in the documentation of h5py and I got to this, I've tested the code with a mock dataset and it seems to work, could you confirm it?

# My idea is to get one h5 file for group B and one file for each member of group A in order to create a bash script with sketchlib and query each sample individualy

import numpy as np
import h5py
import pp_sketchlib
import pandas as pd
GroupB = h5py.File('GroupB.h5', 'w')
GroupB.create_group('sketches')
BigDB=h5py.File('661kdb.h5','r')
MySamples=pd.read_csv(MySamples.csv) # Only the GroupB sample names
GroupB_list=list(MySamples['sample_id'])

with GroupB as f_dest:
    for item in GroupB_list:
        if item in BigDB['sketches'].keys():
            BigDB.copy(BigDB['sketches/'+item],f_dest["/sketches"])

GroupB .close()

### Now the individual files

GroupA= pd.read_csv('GroupA.csv)
GroupA_list=list(GroupA['sample_id'])

for item in GroupA_list:
    hf_tmp = h5py.File(item + '.h5', 'w')
    hf_tmp.create_group('sketches')
    if item in BigDB['sketches'].keys():
        BigDB.copy(BigDB['sketches/'+item],hf_tmp["/sketches"])

This code would create the .h5 files that I need to do a bash script like

for fiile in S*.h5
do
     sketchlib query dist GroupB.h5 ${file} > Query_${file}.csv
done

(I have not tested the bash script yet)

So then I would have all the queries, I could sort and get the first x matches I want.

I have one more question: the original database has a "random" partition, do I need to copy it to the subseted ones? Should I create them aggain? I'm not sure if it affects the distance calculation (I think it doesn't, but I wanted to check)

johnlees commented 2 years ago

Yes sorry, we don't currently have an option to edit the databases – perhaps we could add this. The 'supported' way is through PopPUNK with --run-qc which lets you remove samples from a list, but this probably won't really work for a sketchlib-centric application.

But it looks like you have the code right. Useful references might be:

It might be more efficient to batch your queries rather than use the bash loop (i.e. all the queries in one file rather than lots of files)

I have one more question: the original database has a "random" partition, do I need to copy it to the subseted ones? Should I create them aggain? I'm not sure if it affects the distance calculation (I think it doesn't, but I wanted to check)

Best thing to do is remove it, then add it back in with sketchlib add random <h5 file> – you would do this for the reference database. But if you're using longer k-mer lengths (>17 for typical bacteria) then it doesn't have much effect and you can remove it.

andreaniml commented 2 years ago

Hi, the database was created with the default settings:

Running: BigDB['sketches']['SAMN10241008'].attrs['kmers'] I get: array([13, 16, 19, 22, 25, 28], dtype=int32)

I have kmers of 13 and 16 alongside larger kmers, better to use 'add random' then?

johnlees commented 2 years ago

Yeah, I would suggest doing so. If you're having trouble, another option would be to exclude 13-mers by providing --kseq 16,28,3