GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

Can I submit a subset of kmers from the reads? #30

Open richarddurbin opened 6 years ago

richarddurbin commented 6 years ago

Hello Rayan et al,

I want the unitigs induced by a subset of the kmers from a read set. Is there any way to create a file of kmers that you can read into bcalm2, and then compact into unitigs?

In fact I want to pick out kmers that I think are single copy in a diploid, based on depth. So a simple way to do that would be for you to allow me to give a maximum as well as a minimum copy number for each kmer. I would then set those by looking at a histogram. But ultimately I would like the freedom to correct the kmer counts by GC content (of their parent read pairs), so it would be more general to allow me to pass you a set of kmers.

I guess I could make a fasta file out of them, with one kmer per sequence...

Thanks, Richard

richarddurbin commented 6 years ago

By the way, thanks for the algorithm and tool. It is very elegant.

rchikhi commented 6 years ago

Hi Richard, and thanks for the kind words!

For now, you can specify a minimum/maximum kmer abundance threshold using the options -abundance-min A -abundance-max B, noting that it will keep only kmers of abundance A<=x<=B.

Otherwise, the input formats of Bcalm are fasta/fastq reads, or a custom HDF5 file that contains counted k-mers. That format isn't that easy to create from scratch, but not impossible for us to maybe write a custom program for that. In that format, Bcalm expects that kmers are partitioned according to their minimizer.

Giving as input a fasta file with one kmer per sequence would be an acceptable solution, bearing in mind that abundance would be lost (unless the kmer is repeated as many times as its abundance, which isn't very elegant).

What's the volume of kmers that you wish to give to Bcalm: 10's of millions, billions? Also, in what format do you have the counted k-mers right now?