bcgsc / HLAminer

⛏ HLA predictions from NGS shotgun data
Other
51 stars 15 forks source link

BWA mem vs bwa aln #2

Closed srx3001 closed 6 years ago

srx3001 commented 6 years ago

Hello I have ran the scripts as noted in the test folder and got identical results with bwa aln. Then I tried to use bwa mem, instead of bwa aln. I used: bwa mem rd1.fq rd2.fq > aln_2.sam

then ran the rest of the script the same as for bwa aln. I have noticed the results vary a lot, for HLA-II, the summary is identical, but I am getting a lot less HLA-I alleles with bwa mem.

Ill paste the results (from the legend part) here:

Alignments to HLA

legend +:Coding variant seen by another contig previously RBH:Reciprocal Best Hit (if applicable)

Contig/Read pair,HLA [% HSP/% CONTIG/READ PAIR] A26:33_145_318_1:0:0_1:0:0_70size200|cov1.00,A26:33, [99.00 %H/99.00 %C]
A26:33_130_343_0:0:0_1:0:0_91size200|cov1.00,A26:33,+ [99.50 %H/99.50 %C]
A33:24_139_308_0:0:0_0:0:0_14size200|cov1.00,A68:20, [100.00 %H/100.00 %C]
B55:29_308_467_1:0:0_1:0:0_1fsize200|cov1.00,B55:29, [99.00 %H/99.00 %C]
A26:33_145_316_0:0:0_0:0:0_7asize200|cov1.00,A26:33,+ [100.00 %H/100.00 %C]


Gene,HLA Supertype/Group,Protein coding allele,Score,Validated by N Contigs,x=contig seen,Chance (%),Expect (Eval) value,Confidence Score (-10 log10(P)) Score = sum(Sz cov seq.id. 2 [RBH]) P HLA = (P tig1 is HLA P HLA is tig 1) (P tig2 is HLA P HLA is tig 2) .. (P tig(n) is HLA P HLA is tig n) where P tig x is HLA = sum(P HLA for tig x)

HLA-A A,A26,A26:33,597,3,,100.0000,1.22e-11,109.1 A,A68,A68:20,200,1,,99.9770,2.30e-04,36.4

HLA-B B,B55,B55:29,198,1,,99.9808,1.92e-04,37.2

HLA-C

HLA-E

HLA-F

HLA-G

HLA-DPA1

HLA-DPB1

HLA-DQA1

HLA-DQB1

HLA-DRA

HLA-DRB1

HLA-DRB2

HLA-DRB3

HLA-DRB4

HLA-DRB5

HLA-DRB6

HLA-DRB7

HLA-DRB8

HLA-DRB9


SUMMARY MOST LIKELY HLA-I ALLELES (Confidence (-10 log10(Eval)) >= 30 Score >= 500) Gene,Prediction rank,Group,Allele,Score,Expect (Eval) value,Confidence (-10 log10(Eval))

HLA-A Prediction #1 - A26 A,1,A26,A*26:33,597,1.22e-11,109.1

HLA-B

HLA-C

HLA-E

HLA-F

HLA-G


SUMMARY MOST LIKELY HLA-II ALLELES (Confidence (-10 log10(Eval)) >= 30 Score >= 500) Gene,Prediction rank,Group,Allele,Score,Expect (Eval) value,Confidence (-10 log10(Eval))

HLA-DPA1

HLA-DPB1

HLA-DQA1

HLA-DQB1

HLA-DRA

HLA-DRB1

HLA-DRB2

HLA-DRB3

HLA-DRB4

HLA-DRB5

HLA-DRB6

HLA-DRB7

HLA-DRB8

HLA-DRB9

Why is the difference this big when using bwa mem? Should we always stick to bwa aln when using this tool?

warrenlr commented 6 years ago

Thank you for your interest in HLAminer. You may have to lower the score (and confidence) threshold for displaying HLA alleles/results. Personally, I have never been fond of the direct read alignment method, which is why I implemented the targeted assembly methodology. Minds you that when I did so, Illumina reads were very short (50 to 75 bp) and the direct alignment approach made less sense than it does now (now longer reads are able to capture phased variants).

I am not sure why bwa aln and mem are yielding different results out of the box, but in principle, you should be able to tune bwa mem (and minimap2 for that matter) to obtain similar results as bwa aln. Keep in mind that the latter two tools (mem and minimap) were originally designed to align longer sequences -- Tuning these tools to achieve a similar performance to bwa aln is really a question for Heng Li, however. fyi - We have never fully tested these tools with the HLAminer pipeline, only bwa aln.