arq5x / lumpy-sv

lumpy: a general probabilistic framework for structural variant discovery
MIT License
314 stars 118 forks source link

Filter VCF Prior to Running SVTyper #253

Closed andrewSharo closed 6 years ago

andrewSharo commented 6 years ago

Hi Ryan, I'm getting a very large number of SVs in my vcf (~10^6) after running lumpyexpress on a single sample. SVTyper runs very slowly, so I'm looking to filter my vcf prior to running SVTyper. Do you recommend filtering by SU? Filtering for SU > 20 gives ~20,000 SVs which is more manageable. Do you think I may be losing a lot of real calls by doing this? Should I focus instead on calls that have both split read and paired-end read support?

As a side note, do you think running lumpy jointly on multiple samples will help reduce the number of SVs per sample? I have about 100 samples, but have been running them individually. Best, Andrew

ryanlayer commented 6 years ago

Human?

On Jul 10, 2018, at 7:03 PM, andrewSharo notifications@github.com wrote:

Hi Ryan, I'm getting a very large number of SVs in my vcf (~10^6) after running lumpyexpress on a single sample. SVTyper runs very slowly, so I'm looking to filter my vcf prior to running SVTyper. Do you recommend filtering by SU? Filtering for SU > 20 gives ~20,000 SVs which is more manageable. Do you think I may be losing a lot of real calls by doing this? Should I focus instead on calls that have both split read and paired-end read support?

As a side note, do you think running lumpy jointly on multiple samples will help reduce the number of SVs per sample? I have about 100 samples, but have been running them individually. Best, Andrew

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

andrewSharo commented 6 years ago

Yes, human. I should have mentioned in the first post.

ryanlayer commented 6 years ago

Hg38?

On Jul 10, 2018, at 7:10 PM, andrewSharo notifications@github.com wrote:

Yes, human. I should have mentioned in the first post.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

andrewSharo commented 6 years ago

Yes, hg38.

ryanlayer commented 6 years ago

hg38 has _alot of contigs.

Try using this exclude file:

http://layerlabweb.s3.amazonaws.com/lumpy/hg38_lcr_rand.bed.gz

On Tue, Jul 10, 2018 at 8:36 PM andrewSharo notifications@github.com wrote:

Yes, hg38.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/253#issuecomment-404025963, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUdRIS_dIbdKv6lvZ5sbS_Ff2Fodoks5uFWTLgaJpZM4VKVcX .

ryanlayer commented 6 years ago

I just updated this file, so you may need to redownload.

On Tue, Jul 10, 2018 at 8:43 PM Ryan Layer ryan.layer@gmail.com wrote:

hg38 has _alot of contigs.

Try using this exclude file:

http://layerlabweb.s3.amazonaws.com/lumpy/hg38_lcr_rand.bed.gz

On Tue, Jul 10, 2018 at 8:36 PM andrewSharo notifications@github.com wrote:

Yes, hg38.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/253#issuecomment-404025963, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUdRIS_dIbdKv6lvZ5sbS_Ff2Fodoks5uFWTLgaJpZM4VKVcX .

andrewSharo commented 6 years ago

Thanks, just downloaded. Will run and report back tomorrow.

andrewSharo commented 6 years ago

Hi Ryan, Just ran with exclude file. I still ended up with 626,907 SVs, which is a lot less than 1M but still too much to all give to SVTyper. Let me know if you have any other ideas to decrease the number of SVs found by Lumpy. As I said in my first post, I'm considering filtering by SU or SU and SR since SR is potentially more reliable.

ryanlayer commented 6 years ago

How many are BNDs?

On Thu, Jul 12, 2018 at 11:19 AM andrewSharo notifications@github.com wrote:

Hi Ryan, Just ran with exclude file. I still ended up with 626,907 SVs, which is a lot less than 1M but still too much to all give to SVTyper. Let me know if you have any other ideas to decrease the number of SVs found by Lumpy. As I said in my first post, I'm considering filtering by SU or SU and SR since SR is potentially more reliable.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/253#issuecomment-404585987, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUV43JrYt4WcBqIvDMCUhvJ4TXT_-ks5uF4UigaJpZM4VKVcX .

andrewSharo commented 6 years ago

Here's the approximate breakdown: 196,000 duplications 240,000 inversions 33,000 deletions 158,000 break ends

ryanlayer commented 6 years ago

What is your read depth?

On Thu, Jul 12, 2018 at 1:16 PM andrewSharo notifications@github.com wrote:

Here's the approximate breakdown: 196,000 duplications 240,000 inversions 33,000 deletions 158,000 break ends

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/253#issuecomment-404620704, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUS4hQXKwAOq-XjEQNmQyM4Jex3TBks5uF6CagaJpZM4VKVcX .

andrewSharo commented 6 years ago

Sorry for the slow response. Depending on how it's calculated, read depth is 48 or 51.

ryanlayer commented 6 years ago

At that depth, I would require at least 7 reads. 10 is probably even better.

On Fri, Jul 13, 2018 at 1:58 PM andrewSharo notifications@github.com wrote:

Sorry for the slow response. Depending on how it's calculated, read depth is 48 or 51.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/253#issuecomment-404937868, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUaXRd6cMnjVUzJNwXNK9xzt5tn5kks5uGPvOgaJpZM4VKVcX .

andrewSharo commented 6 years ago

Great to know, and thanks for your help. So SU > 10. Should extra weight be given to calls with split read support?

ryanlayer commented 6 years ago

First off, many good calls will not have split read support and false positives can have support for both. But calls with multiple types of evidence are more convincing. The most convincing to me are those that also have coverage changes.

We are also starting to visualize most of our calls with samplot and SV-plaudit.

https://github.com/ryanlayer/samplot

https://github.com/jbelyeu/SV-plaudit

On Jul 13, 2018, at 4:07 PM, andrewSharo notifications@github.com wrote:

Great to know, and thanks for your help. So SU > 10. Should extra weight be given to calls with split read support?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

andrewSharo commented 6 years ago

Can I get coverage changes directly from the lumpy output vcf or would I need to look at the bams directly with the tools you recommend? I see a BD INFO tag ("amount of bed evidence supporting the variant across all samples") but it seems none of the vcf entries actually have this tag.

ryanlayer commented 6 years ago

SVTYPER can report the depth. But you will need to run it first.

The BD tag is for when you include a bed file of read depth call from something like CNVNator.

On Mon, Jul 16, 2018 at 2:49 PM andrewSharo notifications@github.com wrote:

Can I get coverage changes directly from the lumpy output vcf or would I need to look at the bams directly with the tools you recommend? I see a BD INFO tag ("amount of bed evidence supporting the variant across all samples") but it seems none of the vcf entries actually have this tag.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/253#issuecomment-405377499, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUaPQ49GvhgAyAc9hbejRsVc1g91jks5uHPxogaJpZM4VKVcX .

andrewSharo commented 6 years ago

Great, thank you.

SunWinner01 commented 1 week ago

Hello, I am currently filtering the result files obtained from lumpy and svtyper. My raw sequencing data is 10x, and I first limited the length (50bp -1mb) and sv types (inv, del, dup). As an example of a sample file, I obtained 21659 lines of data initially. Afterwards, I plan to use support read for filtering. What is the appropriate threshold for SU that I set? However, I am still planning to filter based on QUAL in the end. Do you have any suggestions for setting these values? Looking forward to your reply! thank. @ryanlayer