DecodeGenetics / graphtyper

Population-scale genotyping using pangenome graphs
http://dx.doi.org/10.1038/ng.3964
MIT License
171 stars 20 forks source link

Genotype_sv Aggregate model has less Output SVs than Input SVs #152

Open ghost opened 4 months ago

ghost commented 4 months ago

Good afternoon,

I'm writing this to report an issue I've been having while trying to do a test run for graphtyper's genotype_sv command.

I ran 2 sv-callers: Manta and Smoove on 50 samples and then merged their results with Jasmine_sv (similarly to svimmer, maintains the original caller's output information for each variant).

After this, I ended up with a VCF file containing approximately 130,000 structural variants.

I then ran the following command on graphtyper:

graphtyper genotype_sv Homo_sapiens_assembly38_HLA2.fasta \ /path/to/jasmine_merged.vcf \ --output=/path/to/50_samples_test\ --region_file= /path/to/file/containing/contigs_of_interest.txt\ --sams=/path/to/reheadered_bams.txt \ --verbose

After this I took the resulting 6468 VCF files and merged them together using bcftools concat to create a final merged VCF output. Only, the output had only 161,000 structural variants, which is odd since there were multiple records for most of the variants (due to the SVMODEL info field). When filtering for only those with the AGGREGATED model, I ended up with only 66,000 structural variants.

My question is: Does graphtyper carry out some sort of filtering or merging of variants if it considers them to be the same variant when they might not be? Why did my number of structural variants decrease by almost 50% of their original amount? Could it be that there were variants in the original VCF file that overlapped and Graphtyper simply removed these?

hannespetur commented 1 month ago

Are there are no log messages that say some SVs were skipped?

If not, you could change try put the flag --force_no_filter_zero_qual and see if they appear then.

Best, Hannes