adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

wrote 0 records to *table.txt #14

Open WeijiaSu opened 3 years ago

WeijiaSu commented 3 years ago

Hi, I am trying to run TLDR on a human nanopore dataset. I got thousands of clusters for each chromosome, for example: _2021-05-03 14:03:21,455 loaded 5734 clusters from blood_gDNA.fastq-TLDR/chr13.pickle_

but none record wrote to the result table: _finished blood_gDNA.fastq-TLDR/chr13.pickle. wrote 0 records to blood_gDNA.fastq-TLDR.table.txt_

And this situation happened to all the chromosomes. Do you have any idea about this?

Thanks for your help.

Weijia

clemgoub commented 1 year ago

I'm having the same issue!

I was trying on simulated insertions on the human chr 22 (~10 simulated Alu/L1/SVA insertions, 10 simulated non-TE insertions). I can recover these SV with sniffles2. TLDR finds clusters, but writes 0 records. When I apply --detail_output I see 20 fasta files there (+20 bams) but the fasta files have no sequences. Only headers.

I'm looking forward to your feedback!

Thanks,

Clément

WeijiaSu commented 1 year ago

I check a few reads from the clusters, I think they are real insertions. Not sure why they were not included in the final output.

adamewing commented 1 year ago

Hi, sorry to hear it's not returning anything. Are either of you you able to send a .bam file with known insertions that aren't being picked up?

clemgoub commented 1 year ago

Hi Adam! No problem!

I think in my case it comes from the reads source: I haven't looked in details yet (I'll compile the bams, vcf and all the details first thing tomorrow), but it seem to work with bam made with hifi reads, but won't report anything with bam made from ONT reads (each read sets is simulated from the same variants). The reads were mapped with minimap2 using either ont or hifi preset. More details tomorrow! Thanks for your help!

Clément

clemgoub commented 1 year ago

Dear Adam,

Here are the data with some explanation. In the example I'm sending you, I simulated a chromosome 22 (based on hg38 ref) with 14 TE insertions (10 Alu, 3 L1 and 1 SVA) 6 TE deletions (4 Alu, 1 L1, 1 SVA), 10 random insertions and 10 random deletions. These simulated variants can be found in the file sim12.vcf.

data: https://drive.google.com/file/d/1yihbwah1xj-_hC_M28a4HAyRH6rJ7JQK/view?usp=sharing

From the simulated genome sim12.simseq.genome.fa.gz, I simulated 10X ONT with pbsim3 (using ONT error model) or 10X hifi reads with pbsim3 (using PB Sequel II error model, and 10 passes per read) + ccs (to make hifi consensus). Each read set was mapped to the reference using minimap2, and them I used tldr.

For ONT:

minimap2 -ax map-ont hg38.p14.chr22.fa sim12_0001.fastq.gz | samtools sort -m4G -@4 -o sim.bam  -
tldr -b sim.bam -e ~/bin/tldr/ref/teref.ont.human.fa -r hg38.p14.chr22.fa -p 2

For hifi

minimap2 -ax map-hifi hg38.p14.chr22.fa sim12_0001.hifi.fastq.gz | samtools sort -m4G -@4 -o sim.bam  -
tldr -b sim.bam -e ~/bin/tldr/ref/teref.ont.human.fa -r hg38.p14.chr22.fa -p 2

Note that I used teref.ont.human.fa, would you recommend to use teref.human.fa instead with hifi? Anyways, tldr reported variants in this case, but not for ont.

Here is the detail of each file:

sim_tldr
├── hg38.p14.chr22.fa <-- ref genome
├── sim12.simseq.genome.fa.gz <-- simulated genome
├── sim12.vcf <-- vcf for the simulated genome (expected)
├── sim_hifi 
│   ├── sim.bam <-- hifi reads alignments
│   └── sim.table.txt <-- tldr output
└── sim_ont
    ├── sim.bam <-- ont reads alignments
    └── sim.table.txt <-- tldr output

While tldr returns 19 candidates (9 PASS) for the hifi data, there is nothing reported for the ONT alignments. Using sniffles2 (sniffles --minsvlen 100 --reference hg38.p14.chr22.fa --input sim.bam --snf sim.snf --vcf sim.vcf) for each bam, I can recover most of the SV (TE and random SV), so I assume that my ONT bam is valid. Finally, I know that tldr doesn't report DELs, so I understand they would not show up anyways.

Thanks for your help!

Clément

CWYuan08 commented 8 months ago

Dear @clemgoub,

thank you for your post! I'm having the same issue, I am wondering if you have found out how to fix it?

best regards, CW

clemgoub commented 8 months ago

Hi @CWYuan08,

Unfortunately no. My guess is that TLDR didn't like my simulated reads. I ended up only testing TLDR on real data.

Cheers,

Clément