fade missing many reads that appear to be fragmentation artifacts

rhalperin commented 2 years ago

I ran fade on bam that we believe has high level of fragmentation artifacts. I am seeing that 8% of the reads are soft-clipped, while only 0.01% of the reads are classified by fade as artifacts. When I blat some of the soft clipped reads that remain after running fade out, I see that they show the characteristic alignment of the fragmentation artifact, with part of the read aligning to the forward strand and part to the reverse. When I look for these reads in the fade stats-clip output, I see that they have "artifact_status: false". Here is an IGV screenshot illustrating the issue: and here is the input bam subset to the region shown: STD347-81.reg.bam.zip:

rhalperin commented 2 years ago

I wanted to add that I've noticed that reads with longer stretches of soft clipping tend to be less likely to be removed. Taking a look at the code, I noticed this statement in analysis.d

    if ((res.cigar.ops.length == 0) | (res.cigar.ops.length > 10))
        return "";

Are you discarding alignments with length greater than 10, or am I not understanding the context of this statement correctly?

EDIT I realized this is probably checking the length of the cigar string itself and not the alignment, so it is probably not the problem (that would have been too easy).

charlesgregory commented 2 years ago

So this is a known issue with fade, though I would rather say it is a design flaw or really just part of the problem of finding enzymatic fragmentation artifacts.

I ran fade on bam that we believe has high level of fragmentation artifacts.

These look like characteristic artifacts, though when looking at your screenshot I don't see the soft-clipped sequences' reverse complement in the reference sequence (unless I missed something). This likely means that the true alignment position of the artifact is outside of fade's search window.

fade's process

fade looks at the local reference sequence to find reverse complement matches to the read as whole. If it finds a large enough match (> 5 bases) that is anchored on the soft-clipped side, the read is marked as artifact. Though in order to be remotely performant, the alignments performed by fade are done on a window of about 300 bases around the original read's alignment position. So if the true alignment location of the artifact is too far away, fade will miss it. You can use the flag -w to increase the window size. That may improve your results with fade.

Though in reality, fade needs some algorithmic improvement to be more effective, but it would require a major rewrite and employment of new techniques to find and record common artifact locations. I have considered developing potentially a fade2 with such improvements, but have yet to find the time or motivation. Though this issue has been coming to the forefront lately so it may be time to consider it.

I realized this is probably checking the length of the cigar string itself and not the alignment, so it is probably not the problem (that would have been too easy).

I doubt this is the issue as well, though to be honest I can't remember why I set this particular limitation. My guess would be fade used to only realign the soft-clipped portion instead of the whole read, and this was harder to extract on some more complex alignments. It can also make fade out -c process more difficult as I have to be careful to properly clip the alignment, which is more difficult with complex alignments. This limitation can likely be removed but it would only really effect reads that had a larger number of observed modifications like INDELS.

fade tries to be rather conservative to avoid removing real INDELs, as our primary data source when developing fade was human cancer samples. So this issue could also be a result of fade's alignment requirements being too stringent. I could expose the alignment parameters so that those are configurable as well.

Unfortunately, the best solution right now would be to adjust the -w fIag. I can also release a new version with the alignment parameters exposed if you would like to adjust those. If you find parameters that work better for fade that work better please let me know, the default parameters may need to be updated. Or if you want to venture into improving fade's algorithm, I would welcome a PR.

rhalperin commented 2 years ago

At the bottom of the screenshot, there are blat results of a few soft clipped reads that are not classified as artifacts by fade. The screenshot is only showing a 200bp window, and part of each read aligns to the forward strand, and part aligns to the reverse - isn't that what fade is trying to detect?

charlesgregory commented 2 years ago

Ah I completely missed that part of your post. Yes those should be caught. Would it be possible for you to provide me with this section of your bam file so that I may do some debugging/digging?

rhalperin commented 2 years ago

yes, the bam is attached to the original post

charlesgregory commented 2 years ago

Jeez, I missed that too. Thanks for your patience and for the detailed information, I will look into it and see what I find.

rhalperin commented 2 years ago

no problem, thanks for your help!

rhalperin commented 2 years ago

@charlesgregory do you have any updates on this? I'm wondering if it looks like this is fixable, or if we should be looking at other solutions. Thanks!

jblachly commented 2 years ago

with the data you provided it looks very addressable, and we are looking into parameter tuning

It's interesting how well FADE performs on our data, and the poor recognition/fix rate you are seeing, which suggests that needed parameters differ depending on different regions of the genome. Hopefully, a single set of more liberal search heuristics can be identified.

rhalperin commented 2 years ago

Great, thanks for the update!

charlesgregory commented 2 years ago

So this is likely due to fade not being great at detecting whether a SAM/BAM file is name sorted or not and due to how some of its algorithms work.

When fade annotate artifact reads, it does so on primary reads. The aligner (BWA) actually in a way detects some artifact reads by marking them with secondary and supplementary alignments. You see them in your screen shot as artifact reads. When I opened your bam file in IGV (after running fade) with it the only reads that appeared artifact were supplementary and secondary reads, all primary artifact reads had been removed. If it can detect name-sorting, fade will remove all reads with the same name, if any of them indicate having an artifact. If it can detect name-sorting, fade will only remove the primary reads that are marked as artifact. It prints a warning message when it can't detect name-sorting. When I ran your data with the lastest fade version after name-sorting it, fade didn't detect it as name-sorted. fade's current method of determining bam sorted-ness is not great, I have improved this and can push up the fix soon. After running with my fix, these reads no longer appeared in the final bam.

As for the fade stats-clip output, it looks at all reads that are soft-clipped so the supplementary "artifact" reads would appear in the output and show artifact status as false as fade doesn't analyze them currently.

The decision for fade to annotate primary reads only was made quite a while ago and I am not sure if it is the best approach anymore. I was worried about unintentionally removing supplementary reads that may be used to find INDELs. If I allow fade to operate on all reads, primary and secondary reads that are "artifacts" should be removed as well regardless of sorted-ness of the bam file. Then the only advantage of name-sorting before running fade out would be to ensure the read mate is removed. Though I still worry that solution may remove real evidence of mutation. I will investigate this but it may take me a bit longer to look into it. I will have to reanalyze some of our data from the paper.

rhalperin commented 2 years ago

I am seeing reads that are primary alignments that show the characteristic artifact pattern that are not being annotated as artifacts by fade as artifacts. However, I realized that I am not able to replicate the issue that I am seeing running fade in the docker image on my mac. I only see the issue running fade on a cloud workstation running with Ubuntu 18.04.3 LTS. On my mac I see 75 reads annotated as artifacts in the example bam, while on the cloud workstation I only see 14 reads annotated as artifacts. I see identical on the cloud workstation whether I run fade 0.5.5 installed via conda or fade 0.5.7 from the precompiled binary. Here is the bam I get from running fade annotate on the example bam on the cloud workstation: STD347-81.reg.bam.fade.anno.bam.zip I will test whether I see the same issue running on a different cloud workstation platform and post an update.

charlesgregory commented 2 years ago

That does sound strange. The mac and linux binaries from a given version should produce the same output, but I don't own a mac so I can't confirm that explicitly. I will add some integration tests to see if this is the case.

rhalperin commented 2 years ago

I'm not using the mac binary, I'm running the docker image on my mac. I also just tested the linux binary on a different system (cloud workstation running Ubuntu 18.04.6 LTS), and got the same results as on my mac. Something odd is happening on the first cloud workstation. I see the a much lower then expected artifact rate on any bams I have tested on that workstation, and I get the same artifact rate if I run the same bam twice on that workstation. If you are interested in troubleshooting this issue further, I can see if I can get you access. Otherwise, I am just going to run it elsewhere.

charlesgregory commented 2 years ago

Let me see if I can recreate what you're seeing. I can test fade on a Ubuntu 18.04.6 cloud instance with your provided test bam.

charlesgregory commented 2 years ago

I wasn't able to see what you are describing in my testing. Though I was able to update fade's annotate step to include supplementary and secondary reads which I think should address the original issue. I will have a new version out soon.

charlesgregory commented 2 years ago

I have a new version out v0.6.0 could you see if that addresses the extra artifacts you are seeing?

rhalperin commented 2 years ago

No, v0.6.0 has the same issue with missing artifact reads on a specific cloud workstation.

blachlylab / fade

fade missing many reads that appear to be fragmentation artifacts #27

fade's process