akdess / BAFExtract

5 stars 5 forks source link

Baf for only one (the last in the list) chromosome is made. #13

Open chatterjee89 opened 3 years ago

chatterjee89 commented 3 years ago

Hi,

Thanks for making this wonderful tool! I have run into an issue:

After running the following command... samtools view ./WT_96_2.bam | BAFExtract -generate_compressed_pileup_per_SAM stdin ./CaSpER/dm6.list ./CaSpER/baf/wt96_2 50 0; BAFExtract -get_SNVs_per_pileup ./CaSpER/dm6.list ./CaSpER/baf/wt96_2 ./CaSpER/dm6_fasta_pileup/ 20 4 0.1 ./CaSpER/baf/wt96_2/WT_96_2.baf

...I get the following std output:

Generating pileup from SAM file with min mapp qual 50, min base qual 0 Dumping the pileups per SAM file stdin Allocating pileup memory. Done. Maximum # of alleles at each position: 65536 Phred+33 encoding @ 14. read (1): CACTCAATTATATACTTTATATGGTCGGAAAAGCTTCCTTCTGCCTGTAA D@@@@EHHEEHEHHEHHGGIH1G1D1FHHHGHGC@<C@HHHHECHHCE@H Finished reading.00. read
27379256 0 signal, 4037446 4-bit, 636538 8-bit, 26073 12-bit, 18 14-bit, 0 16-bit positions. Skipping loading check. 24265304 0 signal, 3313416 4-bit, 517137 8-bit, 14343 12-bit, 27 14-bit, 0 16-bit positions. Skipping loading check. 21308450 0 signal, 3421544 4-bit, 532190 8-bit, 24291 12-bit, 461 14-bit, 0 16-bit positions. Skipping loading check. 19142111 0 signal, 3866743 4-bit, 516616 8-bit, 16734 12-bit, 67 14-bit, 0 16-bit positions. Skipping loading check. 20211830 0 signal, 2818271 4-bit, 463435 8-bit, 20107 12-bit, 69 14-bit, 0 16-bit positions. Skipping loading check. 3663413 0 signal, 3939 4-bit, 0 8-bit, 0 12-bit, 0 14-bit, 0 16-bit positions. Skipping loading check. 1060875 0 signal, 256237 4-bit, 28894 8-bit, 2125 12-bit, 0 14-bit, 0 16-bit positions. Skipping loading check. Calling SNVs with minimum 20 total and 4 alternate read coverage and 0.100 min alternate frequency. Procesing 3R Loading pileup of length 32079331 Loaded 32079331 long pileup. Procesing 3L Loading pileup of length 28110227 Loaded 28110227 long pileup. Procesing 2R Loading pileup of length 25286936 Loaded 25286936 long pileup. Procesing X Loading pileup of length 23542271 Loaded 23542271 long pileup. Procesing 2L Loading pileup of length 23513712 Loaded 23513712 long pileup. Procesing Y Loading pileup of length 3667352 Loaded 3667352 long pileup. Procesing 4 Loading pileup of length 1348131 Loaded 1348131 long pileup.

Question 1: what does the 20 4 0.1 parameter for BAFExtract -get_SNVs_per_pileup mean?

It seems to have worked and I can see the .baf file along with the allele_counts.bin files for each chromosome in the output folder. This is the output:

$WT_96_2$WT_96_2.baf
      chr position alt ref coverage       baf        dev
21231   4    66150  15   8       23 0.6521739 0.15217391
21232   4   116697 142   0      142 1.0000000 0.50000000
21233   4   116733  87  53      140 0.6214286 0.12142857
21234   4   117428  55  24       79 0.6962025 0.19620253
21235   4   117647  27   0       27 1.0000000 0.50000000
21236   4   138445  16  13       29 0.5517241 0.05172414
21237   4   177684  42  14       56 0.7500000 0.25000000
21238   4   308618  26  12       38 0.6842105 0.18421053
21239   4   310183 111  86      197 0.5634518 0.06345178
21240   4   310405  23   2       25 0.9200000 0.42000000
21241   4   313589  21  18       39 0.5384615 0.03846154
21242   4   331005  19   3       22 0.8636364 0.36363636
21243   4   331333  18   8       26 0.6923077 0.19230769
21244   4   349108  16   9       25 0.6400000 0.14000000
21245   4   372148  20   9       29 0.6896552 0.18965517
21246   4   372423  42  15       57 0.7368421 0.23684211
21247   4   372620  13   8       21 0.6190476 0.11904762
21248   4   408746  56   0       56 1.0000000 0.50000000
21249   4   546394  14   8       22 0.6363636 0.13636364
21250   4   611349  22   0       22 1.0000000 0.50000000
21251   4   691913  33   0       33 1.0000000 0.50000000
21252   4   918729  58   0       58 1.0000000 0.50000000
21253   4   951893  23   0       23 1.0000000 0.50000000
21254   4   952037  33   2       35 0.9428571 0.44285714
21255   4   952409  31   0       31 1.0000000 0.50000000
21256   4   962628  23   0       23 1.0000000 0.50000000
21257   4  1016473  26   0       26 1.0000000 0.50000000
21258   4  1016485  24   0       24 1.0000000 0.50000000
21259   4  1028923   5  41       46 0.1086957 0.39130435
21260   4  1029043  26   0       26 1.0000000 0.50000000
21261   4  1115591  94   0       94 1.0000000 0.50000000
21262   4  1118484  33   0       33 1.0000000 0.50000000

HOWEVER, given that dm6 (fruit-flies) has 2L, 2R, 3L, 3R, 4, X, Y chromosomes, and that the generated baf file only ends up containing the values for chromosome 4, I don't quite understand what happened there. So, Question 2: How can I fix this?

This is for every sample run for the overall experiment. Any help would be greatly appreciated!

Regards, Deep

chatterjee89 commented 3 years ago

@akdess I think I may have figured out both the problems (first one is explained within the source code for main.cpp). But I need a solution for the second one.

I think BAFExtract is only pulling/writing for chromosome 4 because it is the only integer in the lot - fly chromosomes are written as 2L, 2R, 3L, 3R, X, Y and 4 and so there's an issue with detection or with writing somewhere. Am I wrong to assume so? Could you please help me fix this issue?

Regards, Deep