CSB5 / lofreq

LoFreq Star: Sensitive variant calling from sequencing data
http://csb5.github.io/lofreq/
Other
100 stars 30 forks source link

Ignore non ACGT positions in reference #106

Closed andreas-wilm closed 3 years ago

andreas-wilm commented 3 years ago

Doesn't make sense and furthermore produces non-ASCII output.

See bug reported by Kostiantyn Dreval and Ryan Morin in LoFreq 2 somatic

rdmorin commented 3 years ago

Our workaround is to simply filter these lines out at the end. The issue lies in the reference genome itself having non-ATCGN characters in a few places.

rdmorin commented 3 years ago

It looks like this issue can be handled by modifying plp.c to set any reference positions that are not A,C,T or G to N (i.e. just after this line).

ref_base = (ref && pos < ref_len)? ref[pos] : 'N';

rdmorin commented 3 years ago

Adding this immediately below that line fixes the issue. Is there any way this could be applied as a patch?

if (! (ref_base == 'A' || ref_base == 'C' || ref_base == 'T' || ref_base == 'G' || ref_base == 'N')){
          ref_base = 'N';
     }
andreas-wilm commented 3 years ago

Thank you for the PR