cortes-ciriano-lab / savana

Somatic structural variant caller for long-read data
Apache License 2.0
46 stars 2 forks source link

No deletions, inversions, translocations called by the caller #22

Closed yashcrux closed 1 year ago

yashcrux commented 1 year ago

Hello

I have recently used this tool to process pairs of samples, each pair has a germline and a tumor sample. I am doing a somatic calling for SVs. The sample data is produced by pacbio and is CLR

SAVANA provided the vcf files (with the default option) for each pair without any errors but I can see only the INSERTIONS being called. Other SNV's cannot be seen.

Is this expected?

helrick commented 1 year ago

Hi there, thanks for the bug report. This is not expected behaviour.

Are there only insertions in the sv_breakpoints.vcf file as well? This VCF contains all the breakpoints without any filters applied, so should be able to inform whether the problem is happening at the filtering step or earlier.

yashcrux commented 1 year ago

I just searched all the sv_breakpoints.vcf files. and no other SNV's can be found

helrick commented 1 year ago

Interesting. I haven't seen this behaviour before. Which mapper was used to align your BAM files? Do you have an estimate of the sequencing depth of the samples?

Additionally, would you be able to share the command that you used to run SAVANA?

Thanks!

yashcrux commented 1 year ago

I am using minimap2 with the command minimap2 -t 8 -ax map-pb {input.reference} {input.fastq} > {output}

The sequencing depth is around 20 - 30 fold for both tumour and normal samples. Maybe some cases its lower but I am not sure

As for the SAVANA command : savana --tumour {input.Tumour_bam} --normal {input.Normal_bam} --outdir {output} --ref {input.ref}

helrick commented 1 year ago

Are you able to share an example subsetted BAM file?

Additionally would you be able to run the following on one of your sv_breakpoints.vcf files to see how many insertions are being reported?

grep -v "^#" {sv_breakpoints.vcf} | awk '{print $8}' | awk -F';' '{print $1}' | sort | uniq -c

yashcrux commented 1 year ago

Hi

The BAM file is actually not possible to share because of its size

83G Jun 14 13:15 ../../../../BAMS/m54274Ue_220414_145454.final.bam 76G Jun 13 22:23 ../../../../BAMS/m54274Ue_221001_133159.final.bam

Meanwhile the output from the command for the same pair as previous

107 GT 226151 SVTYPE=BND 791861 SVTYPE=INS

helrick commented 1 year ago

From that output it looks like there are quite a few non-insertion variants. BND and INS are the only two types of variants in the SAVANA output VCF by design. Since it's not possible to definitively determine an SVTYPE without copy number information, we've elected to report non-insertion variants as type BND (see a more detailed explanation about this from the GRIDSS SV caller here: https://github.com/PapenfussLab/gridss#why-are-all-calls-bnd).

Instead, we report the breakend orientation using brackets in the ALT field as described in section 5.4 of VCFv4.2. We also report this in a "BP_NOTATION" field which can be roughly converted to different nomenclatures as follows:

Nomenclature Deletion-like Duplication-like Head-to-head Inversion Tail-to-tail Inversion
BP_NOTATION +- -+ ++ --
Brackets (VCF) N[chr:pos[ / ]chr:pos]N ]chr:pos]N / N[chr:pos[ N]chr:pos] / N]chr:pos] [chr:pos[N / [chr:pos[N
5' to 3' 3to5 5to3 3to3 5to5
yashcrux commented 1 year ago

Sorry for late reply but thanks for the detailed explanation of annotations that you have included in the tool.

helrick commented 1 year ago

No worries, thank you for raising this issue. I've added the above information to the main README to prevent future confusion.