blachlylab / fade

Fragmentase Artifact Detection and Elimination
MIT License
11 stars 3 forks source link

PF variants in fade region are not totally filtered #32

Open xiucz opened 2 years ago

xiucz commented 2 years ago

Hi, After diving into your wonderful tool, I find one more question. After running my bam file (reacalibrate.bam with GATK best practice) with the fade annotate and fade out(without -c) , most PF variants will not show in the bam( The first IGV panel bam, fade.bam).

And the second bam, mt2.bam, which comes from mutect2 --bamout option, let's ignore it.

However, there are still some FP variants which can not be filtered by fade software, I know it is not a bug of fade.

We know that fade can filter/trim reads that meet fade‘s inter threshold. But if a variant contains reads both meet fade‘s inter threshold (read A) and not (read B). Should we remove the read B also?

image

b6cf3463a315c5aea4a6106c205c074

Best, xiucz

charlesgregory commented 2 years ago

Thank you for your kind words about fade!

Currently the only reads fade can remove are those it identifies as "artifact". It determines this by realigning the read to the local sequence of the original alignment. If you queryname sort your bam/sam file before using fade out fade will remove any read pair in which either mate is identified as "artifact". In order to assess variants, as a way of determining artifact reads, fade would have to use some variant caller or perform some rudimentary variant calling as part of its analysis.

This would be quite out of fade's scope and would likely be a large undertaking.

Currently you could use fade's extract function to extract the artifact reads in bam format. Then you could create a bed file of regions to in which you wish to ignore variants using the bam file.

Hopefully that helps answer your question.

xiucz commented 2 years ago

@charlesgregory

Thank you, it is a good idea to use fade's extract. So I begin to find the breakpoint where the reverse-complemented happened.

image The picture is taken from the article.

  1. The first length refers to 47bp(softclip) + 8bp(inverted repeat), which should be trimmed with -c option.
  2. BreakpointA refers to reverse-complemented point.
  3. The second length refers to 8bp(inverted repeat) + 47bp(maybe the natural sequence base). But it seems fade software use this strategy to trim bases.

I want to know which strategy should be used, the first length strategy or the second length ?

Best, xiucz