adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

no PASS calls in human test data #32

Open yuliamostovoy opened 1 year ago

yuliamostovoy commented 1 year ago

Hi,

I'm having some trouble testing TLDR on human HIFI data. I'm using the provided teref.human.fa file and testing the program on a 6MB region that I know from other sources has some solid ALU and SVA insertions in this sample (I can manually verify them in IGV using the bam file that I'm inputting to TLDR). The sample was sequenced to 30x. The BAM was aligned with pbmm2, which in theory (?) should be equivalent to minimap2 and has soft-clipping. My run looks like this: tldr -b test_chr1_24000000-30000000.bam -e ~/local/tldr/ref/teref.human.fa -r ~/work/ref/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa

and the output is attached. test_chr1_24000000-30000000.table.txt Thanks!

yuliamostovoy commented 1 year ago

After writing that, I tried running bamToFastq on my BAM file and re-aligned with minimap2, and now TLDR is working as expected (it found most of the ALUs, although not the SVA - in any case, the output seems reasonable now). There must be something about the BAM files produced by pbmm2 that TLDR isn't expecting? We have a bunch of samples that were mapped with pbmm2, so I wonder if this is something that's fixable without realigning all those reads?

adamewing commented 1 year ago

Ah, that's intereting. I haven't tried .bam files from pbmm2 yet - do you know of a public dataset that uses this? If not are you able to share a chunk of a .bam file around one of the aforementioned Alu insertion? (i.e. if it's public cell line data or something that can be shared and not patient data)

Regarding the SVA that's still being missed - if you have a look in IGV, are there reads that completely span the insertion?

yuliamostovoy commented 1 year ago

Yes, no problem, I'm using a 1000 Genomes sample for testing. I'm attaching the pbmm2 BAM file from a region +/- 20kb around an Alu (which gets detected from the same reads aligned with minimap2). Thanks for your help!

The SVA is fully spanned by multiple reads, and TLDR detects an 'NA' insertion there of 4bp but not the full SVA of ~2800bp. I'm including that region realigned with minimap2 in case you want to take a look. bamfiles.tar.gz