Closed CormacKinsella closed 1 month ago
giraffe
always needs three indexes for mapping: gbz/dist/min
the old way was to construct these indexes for your graph with --giraffe clip
which would produce graph.gbz, graph.dist, graph.min
.
the new way is to specify --haplo
. this produces graph.gbz
(same as above) and graph.haplo
. when you pass these two files to giraffe
along with the kff
from the reads, let's say HG002.kff
, giraffe
will make 3 new indexes on the fly and use them for mapping: graph.HG002.gbz, graph.HG002.dist, graph.HG002.min
. These indexes are sample-specific and only relevant to the reads in question. You can usually just delete them after mapping.
we claim that the overhead of creating sample specific indexes for each run of giraffe in this way is made up for by the time saved mapping to them.
Ah that makes sense, thanks for the answer!
Hi Glenn, Thanks again for your answer, I got this working well now.
I do have a follow up question for you on the issue of mapping very short reads (sub 40bp)
In vg issue 3998 it was raised that mapping extremely short reads may require constructing a minimizer index with reduced k, using vg minimizer
.
Is it currently supported to do this as part of the haplotype subsampling option within vg giraffe
? Since specifying a .hapl file builds the subsampled graph, as well as the min & dist indexes - it seems the option is buried to an extend.
Is there a way for users to get some control over the k values in the giraffe pipeline, for example by running the individual steps in building those sub-graph files?
thanks and best regards, Cormac
This is a great question. You probably need to do it manually, make the subsampled gbz with
vg haplotypes [options] -i graph.hapl -k kmers.kff -g output.gbz graph.gbz
Then make the appropriate dist/min indexes as described in the issue. While more of a hassle, this shoudn't cost you any performance.
I'm not sure though, and strongly recommend reposting this issue in the vg github where you'll be able to get a definitive answer.
Thanks for your reply - will do!
Appreciate your help here, closing with a link to the new issue in vg
I have a question/issue on the docs for Minigraph-Cactus, on the haplotype sampling section:
I followed this, and when I ran
vg giraffe
specifying the .gbz and .hapl files, it automatically constructed .dist and .min indexes also. Are these used/required? The current docs suggest that .gbz and .hapl alone are used. For pipelines this might be an important distinction so that redundant indexes aren't created by multiple processes. To generate the four files prior to callingvg giraffe
I have since been using--giraffe clip --haplo
.Thanks for any feedback and the great software!