ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

Haplotype sampling docs - required indexes for vg giraffe #1484

Closed CormacKinsella closed 1 month ago

CormacKinsella commented 2 months ago

I have a question/issue on the docs for Minigraph-Cactus, on the haplotype sampling section:

In order to use haplotype sampling, run cactus-pangenome / cactus-graphmap-join with the --haplo option (and do not use --giraffe). This will create the giraffe indexes for the special .hapl haplotype index which (with the .gbz` is all you need to run vg giraffe using the current best practices.

I followed this, and when I ran vg giraffe specifying the .gbz and .hapl files, it automatically constructed .dist and .min indexes also. Are these used/required? The current docs suggest that .gbz and .hapl alone are used. For pipelines this might be an important distinction so that redundant indexes aren't created by multiple processes. To generate the four files prior to calling vg giraffe I have since been using --giraffe clip --haplo.

Thanks for any feedback and the great software!

glennhickey commented 2 months ago

giraffe always needs three indexes for mapping: gbz/dist/min

the old way was to construct these indexes for your graph with --giraffe clip which would produce graph.gbz, graph.dist, graph.min.

the new way is to specify --haplo. this produces graph.gbz (same as above) and graph.haplo. when you pass these two files to giraffe along with the kff from the reads, let's say HG002.kff, giraffe will make 3 new indexes on the fly and use them for mapping: graph.HG002.gbz, graph.HG002.dist, graph.HG002.min. These indexes are sample-specific and only relevant to the reads in question. You can usually just delete them after mapping.

we claim that the overhead of creating sample specific indexes for each run of giraffe in this way is made up for by the time saved mapping to them.

https://doi.org/10.1038/s41592-024-02407-2

CormacKinsella commented 2 months ago

Ah that makes sense, thanks for the answer!

CormacKinsella commented 1 month ago

Hi Glenn, Thanks again for your answer, I got this working well now.

I do have a follow up question for you on the issue of mapping very short reads (sub 40bp)

In vg issue 3998 it was raised that mapping extremely short reads may require constructing a minimizer index with reduced k, using vg minimizer.

Is it currently supported to do this as part of the haplotype subsampling option within vg giraffe? Since specifying a .hapl file builds the subsampled graph, as well as the min & dist indexes - it seems the option is buried to an extend. Is there a way for users to get some control over the k values in the giraffe pipeline, for example by running the individual steps in building those sub-graph files?

thanks and best regards, Cormac

glennhickey commented 1 month ago

This is a great question. You probably need to do it manually, make the subsampled gbz with

    vg haplotypes [options] -i graph.hapl -k kmers.kff -g output.gbz graph.gbz

Then make the appropriate dist/min indexes as described in the issue. While more of a hassle, this shoudn't cost you any performance.

I'm not sure though, and strongly recommend reposting this issue in the vg github where you'll be able to get a definitive answer.

CormacKinsella commented 1 month ago

Thanks for your reply - will do!

CormacKinsella commented 1 month ago

Appreciate your help here, closing with a link to the new issue in vg