The preprocessing filter (igdiscover filter) currently keeps assignments with 90% V coverage or more. @mateuszatki increased this setting and could avoid some artifacts. The problem is that too short V matches could be seen as the wrong gene/allele when counting exact occurrences. For counting as an exact occurrence, it is sufficient if the covered part of the V is identical to the novel V, so any differences in the non-covered part are ignored.
I have run a small test to see how many rows remain in dataset ERR1760498 at various filter settings. This is the result:
Percentage
Rows remaining
90
513505
94
513464
96
513412
97
513256
98
507612
99
468679
100
183050
So going to 97% is not a problem at all in this dataset and even 98% is fine.
A separate issue should be to consider force-extending all V alignments up to the last 3' nucleotide of the reference sequence.
The preprocessing filter (
igdiscover filter
) currently keeps assignments with 90% V coverage or more. @mateuszatki increased this setting and could avoid some artifacts. The problem is that too short V matches could be seen as the wrong gene/allele when counting exact occurrences. For counting as an exact occurrence, it is sufficient if the covered part of the V is identical to the novel V, so any differences in the non-covered part are ignored.I have run a small test to see how many rows remain in dataset ERR1760498 at various filter settings. This is the result:
So going to 97% is not a problem at all in this dataset and even 98% is fine.
A separate issue should be to consider force-extending all V alignments up to the last 3' nucleotide of the reference sequence.