genomic-medicine-sweden / nallo

An analysis pipeline for long-reads from both PacBio and Oxford Nanopore Technologies (ONT), written in Nextflow.
https://genomic-medicine-sweden.github.io/nallo/
MIT License
17 stars 4 forks source link

Use SVDB merge for merging samples to case #424

Closed jemten closed 1 week ago

jemten commented 2 weeks ago

Hola! Merging of sample SV calls to case should ideally be handled by a tool that can handle the imprecise locations of SV. bcftools merge will only merge exact matches. One option is SVDB merge. Others are Jasmine or SURVIVOR. Maybe a check with @J35P312 could be beneficial.

https://github.com/genomic-medicine-sweden/nallo/blob/1263d726c4d88c0728635119e948a04ba9c2bc48/subworkflows/local/call_svs/main.nf#L86

fellen31 commented 1 week ago

Hm, my question would be: what if you have a sample with a call that have good and exact breakpoints, and then you merge it with 50 other samples and the results becomes less exact?

My idea was that the annotation with SVDB query would is imprecise (and annotate SVs that are the same but not exact matches with the same annotations), but I understand that this would lead to the same SV being reported twice in a "family" / CG case.

J35P312 commented 1 week ago

In general The "precisness" of SV varies across the genome, even within high quality data. There are biological reasons complicating the positioning of SV as well, such as microhomology.

BCFtools is nice for the small SV, they behave like INDELS so they can be merged based on the ALT sequence. For large SV you need to take the start, end and SVtype in account. BCFtools does not look at the END tag, so it will treat the SV as a single point. Then you are better of setting the bnd_distance to 1 in SVDB.

But in truth, its probably better to apply some custom approach for the population genomic projects. I would recomend merging the Sniffles2 files directly using Sniffles2 for instance.

"but I understand that this would lead to the same SV being reported twice in a "family" / CG case."

Not only that! Its important to merge the SV to get the correct inheritance patterns.

fellen31 commented 1 week ago

In general The "precisness" of SV varies across the genome, even within high quality data. There are biological reasons complicating the positioning of SV as well, such as microhomology.

BCFtools is nice for the small SV, they behave like INDELS so they can be merged based on the ALT sequence. For large SV you need to take the start, end and SVtype in account. BCFtools does not look at the END tag, so it will treat the SV as a single point. Then you are better of setting the bnd_distance to 1 in SVDB.

But in truth, its probably better to apply some custom approach for the population genomic projects. I would recomend merging the Sniffles2 files directly using Sniffles2 for instance.

Thanks Jesper. If we do want to use SVDB instead and not Sniffles2 for merging calls, do you think the default 0.6 and 10,000 BND distance is good/reasonable for both say creating a small dataset of 100-1000 samples, and a CG case?

We should also merge calls within-sample from HiFiCNV with calls from Severus/Sniffles, same question there :)

Not only that! Its important to merge the SV to get the correct inheritance patterns.

Yes, definitely!

adameur commented 1 week ago

In my opinion, what should be considered the same SV is a philosophical question and most likely we'll never find a tool that works perfectly. Maybe one thing could be to look at what is being done in big projects around the world, so we're using an approach that facilitates international collaboration? For example, if we're using ColorsDB for filtering maybe it would make sense to use a similar approach as they did.. But I don't know, maybe there are good reasons to choose some other option. In any case I think it's a really interesting and important question. Maybe that graph genomes can improve this at some point but that feels quite far in the future

fellen31 commented 1 week ago

Seems like the most appropriate action is to separate the building and exporting of a VCF for larger population calling/building in-house databases (#372), and exporting a merged case/project VCF (this issue).