EichlerLab / pav

Phased assembly variant caller
97 stars 8 forks source link

large Deletion >20 Mb #31

Closed bioinfogit closed 1 year ago

bioinfogit commented 1 year ago

I am getting very large deletion >20 Mb for the following version. for ##fileformat=VCFv4.2 ##source=PAV 2.2.3 and I think this is the latest version of singularity image . this is the case for multiple human samples with very good assembly metrics and phasing . these are the largest for two samples . considering both haplotype have good contiguity I am wondering if there is any issue with the SV calling. sample1 chr8 DEL -31.5771M chr7 DEL -47.4833M chr11 DEL -50.008M chr1 DEL -95.0136M sample2 chr9 DEL -20.1368M chr9 DEL -20.8878M chr5 DEL -22.838M chr2 DEL -88.2308M

paudano commented 1 year ago

There are a couple of potential causes for these. PAV calls SVs using two approaches, (a) by looking directly at the alignment (CIGAR string), or (b) by traversing loci where the alignment is split into two records. Indels and smaller SVs are called from the CIGAR string (a), but larger SVs cause the aligner to fragment contigs into two records, for example, one alignment stops at the left end of a deletion, then another alignment starts at the right end.

To make a deletion call this large, the aligner must be fragmenting, and PAV requires that the contig pick up on the right side where it left off on the left side (i.e. each side of the deletion is bordered by the same contig name and contiguous contig coordinates), and that's how PAV makes (b) calls. Minimap2 fragments large SVs, but LRA is often able to capture very large SVs in the CIGAR string (a), although I doubt deletions this large would be in a single alignment, and so I am pretty sure it's calling from fragmented alignments by (b).

I see two possibilities: 1) They are true deletions, although this seems very unlikely. I can't imagine many cells surviving a deletion this large even if heterozygous. Could also be somatic and anneuploid depending on where the sample came from (i.e. a tumor).

2) They are a misassembly. This is the most likely cause and the only deletions I have seen in the megabase range in the HGSVC work were misassemblies where a misjoin across segmental duplications caused a contig to map aberantly and look just like a large deletion.

If it's a misassembly (2), then PAV is dropping all variants inside the deletion assuming that contigs mapping within a deleted locus are aberrant alignments, which does happen.

To investigate this, you can look at the alignment BED (results/SAMPLE/align/aligned_tig_HAP.bed.gz, where SAMPLE and HAP match the sample/hap from the variant call) and see if there are many alignment records under the deletion. If you prefer to work with CRAMs, ask PAV to generate that file but with .cram instead of .bed.gz. It can also generate a UCSC track in BigBed format (tracks/SAMPLE/align/post-cut.bb, track color: pink=hap1, purple=hap2).

If you determine it is a misassembly, using PAV alignments or other tools (i.e. I think Flagger should be able to do this), then you can create a BED file of misassemblies in contig coordinates and give it to PAV as a misassembly filter. On the variant call, you'll see a TIG_REGION annotation for the deletion call (the easiest place to get it is directly from the pre-merged BED, results/SAMPLE/bed/pre_merge/HAP/svindel_del.bed.gz), and you can make a contig BED from that locus, which will be 1 bp long. Note that TIG_REGION is 1-based coordinates, so subtract one from the start position. Also, I would pad it by at least 5 bp on each side, and maybe more to eliminate false variants around the contig misjoin.

With that filter, variants touching that region of the contig will be ignored causing PAV to drop the false-deletion and bring back all the variants underneath it. The configuration option is "tig_filter_pattern", which must contain sample and haplotype wildcard patterns (e.g. /path/to/tig_filter/{sample}_{hap}.bed.gz), then force-rerun PAV from rule call_integrate_sources (add -R call_integrate_sources pav_SAMPLE.vcf.gz to the PAV run command).

I think I need to develop some heuristics in PAV to detect and ignore some large deletions, although these are pretty rare and I haven't seen multi-Mbp deletions with modern aligners. If you are able to share the alignments and variant calls from these, it might help me develop this into PAV.

bioinfogit commented 1 year ago

Dear Peter

Thanks for your thorough response. I also noticed that running against chm13 with default options resulted in fewer large SVs. However, I still obtained a maximum of 21 Mb deletion for the same sample, compared to 100Mb deletion on hg38. reference:hg38_noalt > 10Mb results

chr16 -12.7807 Mb .|1 chr14 -14.6153 Mb 0|1 chr1 -18.0844 Mb .|1 chr9 -18.9375 Mb 0|1 chr9 -19.6084 Mb 1|1 chr1 -28.6123 Mb .|1 chr1 -91.5013 Mb .|1 chr5 -109.399 Mb .|1 Chm13: chr9 -10.6493 Mb 1|0 chr1 -14.4388 Mb .|1 chr1 -15.1913 Mb .|1 chr4 -17.679 Mb 0|1 chr9 -19.1693 Mb .|1 chr9 -21.9487 Mb .|1

I also tried LRA but got the following error I used the following config file on latest image (2.2.3 apparenly we don't have image for 2.2.4.1 or 2.2.4) { "reference": "hg38_primary.fa", "aligner": "lra", "inv_threads": 1, "inv_threads_lg": 1 }

Finished job 29. 8 of 288 steps (3%) done Sorting 629464705 minimizers done Sorting 527676116 minimizers with multiplicity smaller than 30 There are 131225680 minimizers left Waiting at most 20 seconds for missing files. MissingOutputException in rule data_align_ref_lra_index in line 73 of /opt/pav/rules/data.snakefile: Job Missing files after 20 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait: data/ref/ref.fa.gz.mmi completed successfully, but some output files are missing. 16 Removing output files of failed job data_align_ref_lra_index since they might be corrupted: data/ref/ref.fa.gz.gli Exiting because a job execution failed. Look above for error message

paudano commented 1 year ago

You might have an old version of LRA that wrote the index file as ".mmi", but new versions write it as ".gli" (I think since LRA version 1.2). PAV runs LRA to generate the index, and it expects to find data/ref/ref.fa.gz.gli. If yours is generating data/ref/ref.fa.gz.mmi, then upgrade LRA or symlink the "mmi" file to data/ref/ref.fa.gz.gli (I believe the format is the same, but LRA would ignore it anyway).

paudano commented 1 year ago

As for the large deletions, I still think they are a misassembly and the best way to deal with it is to use the contig filter (comment above) to remove those calls. If you are able to share the SV calls and the alignments (results/SAMPLE/align/aligned_tig_HAP.bed.gz), I can use it to help improve this in a future version and figure out the right approach to ignore large false DELs.

bioinfogit commented 1 year ago

thanks a lot I will investigate this and will report here .