LengthIns=0, "StartTE/EndTE" are "NA" - Githubissues

adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data

MIT License

40 stars 4 forks source link

LengthIns=0, "StartTE/EndTE" are "NA" #7

Closed WeijiaSu closed 4 years ago

WeijiaSu commented 4 years ago

Hi @adamewing I am trying to use TLDR with Drosophila ONT sequences, I only tested it with one particular TE (LTR) as the reference TE insertion, I got the result .txt file with no error reported. However, I the "LengthIns" column of all (~300) detected insertions is 0, and the "StartTE/EndTE" are "NA". I have a few known insertion sites that should be around 6000-7000 bp and they are also labeled as LengthIns=0. Thanks for your help. Weijia

adamewing commented 4 years ago

Are these filtered insertions (i.e. something other than "PASS" in the Filter column)?

WeijiaSu commented 4 years ago

Thanks for your reply. The Filter column shows "LeftFlankSize,RightFlankSize,UnmapCoverNA,NoTEAlignment,NonRemappable,ShortIns"; I used similar command line as the run_test.sh: tldr -b test.sort.bam -e MyTestTE.fa -r MyGenome.fasta --color_consensus -p 16 --max_te_len 100000 --detail_output Thanks for your help. Weijia

adamewing commented 4 years ago

Thanks. That will be why there is no TE length as those filters indicate there was no good TE alignment for that insertion (do all insertions have this set of filters or do some "PASS")? A few questions:

how was the .bam file aligned? I've mostly tested with minimap2 .bams.
what does the sequence ID from MyTestTE.fa look like?
Do you see filtered insertion calls at the expected TE insertion positions?

WeijiaSu commented 4 years ago

Thank you. All insertions have the same filter, no entry labeled as "PASS". For the questions

I used minimap2.bam
The sequence ID is >HMS-Beagle; (only 1 TE in there)
I did see a few output entries that are overlapping with the expected insertions. I also manually check the raw reads from a reported insertion by the *.detail.out files, these 36 supported reads all aligned to the TE, some even aligned more than 6000 bp.

adamewing commented 4 years ago

I think I managed to replicate this - it's a documentation bug: I haven't explained what's expected in terms of the te ref file fed to -e. The .fasta header should include a superfamily (e.g. L1) and a subfamily (e.g. L1Ta) seperated by a ":" (see ref/teref.human.fa for examples). So try changing ">HMS-Beagle" to ">TE:HMS-Beagle" and let me know what happens.

adamewing commented 4 years ago

Added a check for formatting on .fasta file passed to -e/--elts in a38312f, it should refuse to use a te ref file without the right formatting and explain what that should look like now.

WeijiaSu commented 4 years ago

Hi, I re-ran the pipeline as you suggested, looks like it worked properly. Thank you for your help! Weijia

adamewing commented 4 years ago

Great, will close this then.