adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

Finding non fully embedded reads #24

Closed mmisak closed 2 years ago

mmisak commented 2 years ago

Hello,

I was wondering whether it is possible to run TLDR in a way, such that it also finds not fully embedded reads. I.e. reads that only support the insertion on one side and then the read ends.

Regarding the "--embed_minreads" parameter, the Readme says "Minimum number of reads completely embedding the insertion (default = 1, requires at least 1)". To my surprise, the program also started running when I tried setting this parameter to 0. (However, the program is running for 5 days already for one sample).

I was wondering: Would this actually give me reads that only support the insertion on one side since the Readme is apparently discouraging doing this?

adamewing commented 2 years ago

Hi, Embedded reads are a hard requirement given the way tldr works so the best workaround would probably be to create 'synthetic' reads with embedded TE insertions through assembly, align the contigs and feed them to tldr along with the original .bam. Given that assembly tends to 'collapse' one or the other allele it would be best if this were a haplotype-specific assembly. I don't have suggestions for what tools to use offhand (possibly pepper/deepvariant --> whatshap --> flye or something) but it's something I'd like to investigate in the near future. I'd also point you at PALMER: https://github.com/WeichenZhou/PALMER which I think handles insertions without the requirement for at least one embedded read.

I haven't tested --embed_minreads 0 but I'm surprised it didn't crash straight away. Will disallow that as well (if it does finish I suspect the output won't be useable).

mmisak commented 2 years ago

Thanks for the detailed reply and thanks for also suggesting PALMER, I'll give it a shot!