adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

Insertion consensus letter-case #35

Closed maxfieldk closed 5 months ago

maxfieldk commented 10 months ago

Hi Adam, Thank you for this excellent piece of software. My question pertains to the case-convention followed when writing the insert consensus sequence. I was expecting to find a single contiguous lower-case region for all inserts. However, I find the following type of situation in many inserts (I've made up the number of bases but directionally this is what I've observed):

Upper case 100 bases Lower case 100 bases Upper case 100 bases Lower case 600 bases Upper case 100 bases

If I use --color_consensus, the first Lower stretch is colored black, not yellow.

I was initially using the letter case to identify the TE coordinate, but have therefore realized this is not a good bet. How should I go about extracting the coordinates of the TE insert for the purpose of building a gtf annotation? I've attached an example photo of the consensus.

Screenshot 2023-11-15 162212

Thanks again!

adamewing commented 10 months ago

Lower case indicates sequences not aligned to the reference genome, so in general the insertion should be lower case. tldr allows multiple unaligned segments per insertion but they'll only be coloured like that if they match something in the te references. If this is human data my feeling is the example you posted is not a real insertion - should have something in the filter column?