FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

BWA is not supported [as BWA does not allow Ns in the index] #19

Closed RoseString closed 6 years ago

RoseString commented 6 years ago

Hi @FelixKrueger ,

Thank you for the great tool! It is a great help to my research.

I have a question. Is it possible to add BWA support? Our lab is interested in doing population genetics with split reads. Basically, we will follow the GATK pipeline which recommends using BWA for read mapping. I have tried running bwa (with soft-clipping option off) followed by SNPsplit, but it failed. So I guess BWA output is not yet supported.

Also I'm wondering in a future release if we'll be able to run the tool with soft-clipping on? That would increase the mapping efficiency.

Thank you very much!

FelixKrueger commented 6 years ago

Hi @RoseString,

Thanks for your comments. I have personally not used BWA before (we seem to be one of the few places in the UK using Bowtie2...), but I imagine that BWA output should work with SNPsplit as well as long as it contains an MD:Z: string. Similarly, STAR requires a few special options to output this string, as is mentioned here:

SNPsplit requires the MD:Z: field of the BAM alignment to work out mismatches involving
masked N positions. Since STAR doesn’t report the MD:Z: field by default it needs to be
instructed to do so, e.g.:    
    --outSAMattributes NH HI NM MD

If you are familiar with BWA and could look into whether there are any options to output the MD:Z: string, else I can probably also take a look at some point.

I am afraid supporting soft-clipping would be a major addition to the code, and I can't see this happening any time soon.

Cheers, Felix

RoseString commented 6 years ago

Hi @FelixKrueger ,

Sorry for the delayed reply. I got really busy last few days.. Thanks for your willingness to help! Below is some info for a test run.

The bwa mem command I used was : bwa mem -L 1000,1000 ref/ref_N-masked.fa reads1.fq reads2.fq > out.sam (-L 1000,1000 was to turn off the soft-clipping, as recommended in the SNPsplit documentation)

Here are some example lines from this run. It seems MD:Z: field is automatically turned on, but other attributes like NH and HI are missing? I don't know if that's causing the issues. Thanks again!

HWI-ST330:350:H0WYEADXX:2:2211:4093:68465 147 NW_005082037.1 3820 60 148M = 3558 -410 AATCACAATGTTATAAATAAATAAAGTAATTAATAAATTAAGGGATTAAAAATGAAATAATAAAATAAATCATACCCATATGGAGGAAATAAATAAAACTGAATAAATTAATAAATAAAAATAAAGTAAGGGAAAATAAAAAAGAAAA DCCCC:A>4:>CEDEDEEEDC>EDCA>DCCDDFDEEDCDDCC@>C>=CAEEC@EFFFEFFCFHEEAHHE=CE@E@FAF@GCGBHEIJJIHJIGBJIIGBBCIGBJIGBGIIHJJJGJJJJJJJJJJJJIJJJJJJJIHHHHHFFFFFC NM:i:8 MD:Z:5T28T3A4A3T10C2G3G82 AS:i:108 XS:i:62 HWI-ST330:350:H0WYEADXX:2:1209:5246:70463 99 NW_005085055.1 1580 41 1M4I143M = 1957 524 TCCAAAAAATCACCAGCAATGTGTTCTATAATTAGCTGACTAAATACATAACTATAGCAAAGCAAGCAAATCAATTGTCTCTCCTTCTCTTCTCTCTTAGTTACTTTCCACCATGAGTTAAAGCAGAGTGTGCAAACAATACATATCT BFFFFFHHHHHJJJIJJJJJJJJIHGJJIJJJJJJJJIJIIJJJJJIJIJJJIJIIJJJJJJIJJJJJJHHFHEEFFFBEDCCEEEEDDDDCDDDDDDDDDDDDDDEDDEDDDDDCDDEDDCCDCDCDDCDDDDCCBDDDDDCEEDDD NM:i:8 MD:Z:9T38A8A85C0 AS:i:127 XS:i:125 XA:Z:NW_005082193.1,+94222,2M1D146M,6;NW_005081657.1,+235965,2M1D146M,9;NW_005082263.1,+32990,83M3I62M,10; HWI-ST330:350:H0WYEADXX:2:1209:5246:70463 147 NW_005085055.1 1957 40 57M1I90M = 1580 -524 CAAGGGAATTGCGGTCCGCAGCCACTCCAGGATGCTTAGCTCCTTCTTAATAAATCAGGGGGTAATATAAATCATGTTCCAACAATATTTTAGCTACTGTTTGAACTGTAGCCCAAGCTGCTGGAAATGCTTTAACCCACTGAGTTAA DDDDDCAB>0>BB><.BDBB@ADDDDCC?DDDDCDDBBB?BDDDCB@CDECDCDDDCCC>FFFFEFD@A=HEEGIHGGIHIJIJIJJIIGIGD?HIIHIED@GJJIGGIIGJIHGHGIIGJJGJIIFJGJIGHJHBDHHGFFFFF@ NM:i:4 MD:Z:18T20T20T86 AS:i:125 XS:i:125 XA:Z:NW_005084917.1,+2079,86M1I61M,4;NW_005082193.1,-94601,57M1I75M1I14M,7; HWI-ST330:350:H0WYEADXX:2:2105:2980:28194 99 NW_005081615.1 2320121 60 148M = 2320555 510 TGGATAGTTTTCACCAATTTTCCCCCCATTATAGAACTCTATATGCTCATGAAAAAAAACAAACATTACTTGAAAGAGAGATTTCCCTATTTTGCCCAGCTGAGAGCTCAGAAATCTTGCATGGGAGTGCAAAACACTTCTCGTGGAT CFFFFFHHHHHJJJJJJJJJJIJJJJJGHIIIJJJJJJJJGIJJJJJJJJIJIJJJJHFFDFEDCEEDDDDDDDDDDDDD?CCDDEDDDDDEECCDDDBACCCAA@ACDDDDDD+:>CCCDDDDDDD?BDDDDDDBDDDDDCBDDDBC NM:i:5 MD:Z:56T57G14T0G5T11 AS:i:123 XS:i:20 HWI-ST330:350:H0WYEADXX:2:2105:2980:28194 147 NW_005081615.1 2320555 60 76M = 2320121 -510 GGTAAGAGCAATGAGTCAAACACAAAAAAAGTGGGCAAACACAAAAACAGTGTGGTTTTTTTTCTCTTAACTTTGT >+(DDDDDDDDDDDDDDDCCDDDDFHJJJJIIHJHFIIHHFIJJJIGJJJJJHGJJJJJJJJJJHHHHHHFFFFFC NM:i:2 MD:Z:2C2C70 AS:i:70 XS:i:0

FelixKrueger commented 6 years ago

Thanks for that. I'll be going skiing soon so I'm not exactly sure when I will get to this. But I won't forget about it! Cheers, Felix

FelixKrueger commented 6 years ago

Hi @RoseString ,

I just had a look at this issue again. First, I indexed a CAST/B6 genome withbwa index, and then I aligned some CAST/B6 data to that genome with a command line very similar to yours:

bwa mem -t 8 -L 1000,1000 /bi/scratch/Genomes/Mouse/Castaneus_based_on_GRCm38.v5/CAST_EiJ_N-masked/BWA_B6_CAST.fasta file_L001_R1_val_1.fq.gz file_L001_R3_val_2.fq.gz. I was puzzled by two facts:

1) After filtering out non-primary and secondary alignments, I was very surprised to see that the first 2 alignments I was soft-clipped to more than 70% of the read length, despite supplying already adapter/quality trimmed reads and setting the penalty for clipping to -L 1000,1000:

NB501547:136:HLT3GBGX5:1:11101:15723:1049       163     4       110215658       40      19M47S  =       110215707       123     ATATCACCTCACTCAGCAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNGNANACATCTTG      AAAAAAEEEEEEEEEEEEE################################EE#E#E#AEEEEAEE   NM:i:0  MD:Z:19 AS:i:19 XS:i:0
NB501547:136:HLT3GBGX5:1:11101:17935:1050       99      2       137454998       60      75M     =       137455141       163     CCTCANGTCCATCATCTAGATCATAAGGGTTTCCTCAGCCATTGAAACTCTCCAGTACTNTTTNTNNTNGAAATT     AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#EEE#E##A#AEEEEE  NM:i:7  MD:Z:5T5G47C3C1A0A1T6   AS:i:58 XS:i:0
NB501547:136:HLT3GBGX5:1:11101:17935:1050       147     2       137455141       46      55S20M  =       137454998       -163    ACTNNNGANATCATGCCNCAGNTTNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGTCTGGCTTCCTGTGGTCA     EEA###AE#EEEEEEEE#EEE#EE#E#############################EEEEEEEEEEEEEEEAAAAA  NM:i:0  MD:Z:20 AS:i:20 XS:i:0

2) When proceeding with SNPsplit on some 1M mapped reads (command was SNPsplit --snp all_SNPs_CAST_EiJ_GRCm38.txt.gz --no_sort --paired), I found that it ran through just fine (not sure what you meant when you said it failed). It produced the following SNP coverage report:

===================
SNP annotation file:    all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:   20668547
N-containing reads:     0
non-N:                  995273
total:                  999687

Which I found very odd, as it seems to suggest that no reads covered any of the N-masked positions at all. I then tried the same thing with Bowtie2, and this is the result:

SNP coverage report
===================
SNP annotation file:    all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:   20668547
N-containing reads:     331645
non-N:                  645851
total:                  999976

...

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               493533
        thereof were read pairs:                        484171
        thereof were singletons:                        9362
Reads were unassignable (not overlapping SNPs):         253607 (51.39%)
        thereof were read pairs:        247347
        thereof were singletons:        6260
Reads were specific for genome 1:                       124313 (25.19%)
        thereof were read pairs:        123590
        thereof were singletons:        723
Reads were specific for genome 2:                       114384 (23.18%)
        thereof were read pairs:        112067
        thereof were singletons:        2317
Reads contained conflicting SNP information:            1229 (0.25%)
        thereof were read pairs:        1167
        thereof were singletons:        62

So with Bowtie 2 I get ~50% allele-specific reads, but with BWA I get 0% for the exact same data! I then did some quick Googling how BWA handles Ns during the indexing procedure, and I found evidence that it randomly replaces N with C, A, T or G, please don't ask me why this would be a good idea...

So I am afraid the answer to your original question is probably: No, BWA does not support alignments to genome with N-masked bases, and hence BWA alignments are inherently not compatible with allele-specific sorting in SNPsplit.

RoseString commented 6 years ago

Thank you so much @FelixKrueger for looking into this in such details! It also sounds odd to me that BWA replaces Ns with random bases during indexing.

I really appreciate the time you put into this!