ChaissonLab / danbing-tk

Toolkit for VNTR genotyping and repeat-pan genome graph construction
BSD 3-Clause "New" or "Revised" License
23 stars 3 forks source link

danbing-tk align empty *.kmer output #23

Closed cappadoc closed 1 year ago

cappadoc commented 1 year ago

I conducted danbing-tk align with my short read-based bam file, which had been aligned against the GRCh37 reference using bwa mem. The command, in the working directory containing the presence of the RPGG pan files, completed without any errors, but the generated kmers output file (NA12878.tr.kmers) was empty and the file size of aln.gz file was only 1.3 MB. Is there any solution for this trouble?

My command and output on the console are the follows:

samtools fasta -@ 2 -n NA12878.rmdup.bam | danbing-tk -gc 80 -ae -kf 4 1 -cth 45 -o NA12878 -k 21 -qs pan -fai /dev/stdin -p 6 | gzip > NA12878.aln.gz

use baitDB: 0 extract fasta: 0 interleaved: 1 sim mode: 0 trim mode: 0 augmentation mode: 0 graph threading mode: 1 output alignment: 1 output successfully aligned reads only: 1 k: 21

of subsampled kmers in pre-filtering: 4

minimal # of matches in pre-filtering: 1 Cthreshold: 45 Rthreshold: 0.5 threading Cthreshold: 80 Running both step1 (kmer-based filtering) and step2 (threading) fastx: /dev/stdin query: pan.(tr/ntr).kmers

total number of loci in pan.tr.kmers: 80518 deserializing kmerDBi.umap deserializing kmerDBi.vv deserializing graph.umap reading kmers from pan.tr.kmers deserialized graph/index and read tr.kmers in 127 sec.

unique kmers in kmerDBi: 147568763

creating data for each process... threads created Buffered reading 300000 300000 0 Buffered reading 300000 600000 0 Buffered reading 300000 900000 0 Buffered reading 300000 1200000 0 Buffered reading 300000 1500000 0 Buffered reading 300000 1800000 0

Thank you for your help.

joyeuxnoel8 commented 1 year ago

Hi! Do you have an example of what the output of samtools fasta -@ 2 -n NA12878.rmdup.bam looks like?

cappadoc commented 1 year ago

It looks like the following:

>A00217:76:HFLT3DSXX:4:2608:14073:14810 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGATATCGGAAGAGCTTCGTGTAGGGAAAGAGTGTTAATTAGTGTGTAGTTATCGGATTTCGAAATATAAATAAAAAAATTAAATAAAAGAGAAAAAGTAATTATA >A00296:43:HCLHLDSXX:3:1140:23457:10520 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTAAGGGTTAGGGTTAGGGTTATCGGAAGAGAACACGTAAGAACTCCAGTAAGTTGCCTGGATCGGGTAAGCGGTCTGGTGCGGGTGAGGGGTGGGGGT >A00217:77:HFJWFDSXX:3:1215:18774:33411 GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGATAGGGATATCGGAAGAGAAAACTAAAGAAATCCAGTAACTTTACTAGATATCGTAATGCGTCTTGTGCTTGTAATGGGTGGGGGGGGGGGGGG >A00217:77:HFJWFDSXX:3:1358:27046:24471 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGATATCGGAAGAGAAAACGACAGAAATCCAGTCAAATTACTAGATCGCGTAAGACGACTTCTGCTAGAAAAGGGTGGGGGGGGGGGGGGGGGG >A00217:77:HFJWFDSXX:4:2669:5828:29371 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTATCGGAAGAGAAATCTTAAGAGAAACTGTCAGGGTACTAGATAGCGTAAGGCTTCTTGTGAGTGAT >A00217:77:HFJWFDSXX:3:2201:16233:34992 GGGTTAGGGTTAGGGGTAGGGTTAGGGTTAGGGTTATCGGAAGTGATCACGTCAGAAGTACAGTAACTTTGCTTGTTCGGGTTAGGGGTCTTGTAGGGTAAGGGTAGGGGGTGGGGTCGGGGTAGGGTGGGGGGTGGGGGTGGGGGTGGG . . The bam has 675776582 reads in total. I got a successful result when using the sample bam file (HG00514.filtered.reads.bam) contained in the danbing-tk package.

joyeuxnoel8 commented 1 year ago

The reads look fine to me at first glance. Just checking -- is this sample paired-end sequencing? The read names of a read pair have to be the same for danbing-tk to map paired-end reads.

cappadoc commented 1 year ago

Yes, this data is paired-end reads with the same name for each pair, and the bam is sorted by coordinate. For example, reads of a pair are output as follows:

>A00296:43:HCLHLDSXX:3:2254:22779:3521 CTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCTCTTGCTCAAGGTGTAGTGGCAGCACGCCCACCTGCTGGCAGCTGGGGACACTGCCGGGCCCTCTTGCTCCAACAGTACTGGCGGATTAT . . >A00296:43:HCLHLDSXX:3:2254:22779:3521 CTCCAGGGCTTCACCTGTTAGACTGGCAAAAATCCAAAAGTAAACACTTTGTGGAGAAACAGGCATTCCTAGACATTGCTGGTGGGATACAGAACAGTACAATTCTGATGGTAATCAGTTCACAAATTAAACATATTTATTTTTTACTTT

joyeuxnoel8 commented 1 year ago

Hi, sorry for the huge delay on following up. Have you figured out the reason why the mapping is not working? Let me know if I can help with debugging. Feel free to drop a test bam file here. -Tony

cappadoc commented 1 year ago

I reinstalled the latest version and run the job. The NA12878.tr.kmers and NA12878.aln.gz files seemed to be normally created. Thank you for your help.