Setting the right quality thresholds

ealler commented 2 years ago

Hi! Thank you for a nice program! I have a question regarding the TE outputs. In my study of an Arabidopsis line, I find 138 TE movements. I think some of the results are not "real" movements but more reflecting alignments to other TEs with high homology. When looking in a genome browser of a corresponding bam file (aligned with BWA mem), I cannot confirm many of the movements. Have set the MAPQ to 30. Do you suggest some kind of filtering on the output? Such as filtering on allele frequency or on read length prior to running your program? Thanks again! Best, Emma Aller (Copenhagen University, Denmark)

shunhuahan commented 2 years ago

Hi @ealler,

Thanks for the question! I'm assuming that the 138 TE movements you were referring to are non-reference insertions. Could you clarify on how you are confirming the TE movements and maybe show one or two examples of the IGV screenshots you have trouble confirming?

Here are some suggestions regarding to how to properly run the program and interpret the results.

We have good success applying ngs_te_mapper2 on 100bp paired-end and single-end reads with >=30X coverage from Drosophila. I would recommend running ngs_te_mapper2 as a component method in McClintock together with trimgalore to QC the fastq files and trim the adaptors prior to running ngs_te_mapper2 (see details in https://github.com/bergmanlab/mcclintock).
What is the allele frequency distribution you get on the non-reference TEs predicted by ngs_te_mapper2 and does it fit your assumption? I would suggest filtering on allele frequency post running ngs_te_mapper2 based on what the data shows and what the biological assumption is.
ngs_te_mapper2 has one parameter that define the maximum TSD size (--tsd_max). You could adjust this parameter value based on the typical TSD size range for Arabidopsis.

Hope this helps a bit and let me know if you have follow up questions :)

Best, Shunhua

ealler commented 2 years ago

Hi Shunhua! Thank you so much for your fast and elaborate reply. Your right, the 138 TEs are non-ref. Am a bit confused.. :) I thought the pipeline aligned against a TE ref file first and then the genome which would make all identified TEs "reference TEs".. Can you comment on the difference between nonref and ref TEs with the very nice introduction figure you have in mind? And thank you for your suggestion on using mcclintock, I will give that a try. Best, Emma

cbergman commented 2 years ago

Hi Emma

You are correct that ngs_te_mapper2 aligns reads to a TE library, then maps TE+flank reads to the reference genome. In that sense, all TEs identified by ngs_te_mapper2 are relative to the reference genome. However, there is a crucial difference between what is referred to as "non-reference" vs "reference" TE insertions in the variant calling community. Non-reference TE insertions are TEs that are not present in the reference genome assembly, but are present in the sample being analyzed. In other words they are a insertion variant relative to the reference genome. In contrast, reference TE insertions are TEs that are present in the reference genome assembly and are either present or absent in the sample being analyzed. In other words, they are not an insertion variant relative to the reference genome, but rather a sequence that is present in the reference genome and may or may not be present in the sample. Figure 1A from the TEMP paper shows a clear example of a non-reference TE insertion (not present in reference, present in sample), while figure 1B shows an example of a reference TE insertion that is not present in the sample (present in reference, not present in sample). Unfortunately, I am not aware of good example of a figure showing a reference TE insertion that is present in the sample that I can point you to. Hopefully this explanation will get you moving in the right direction understanding these terms.

Thanks, Casey

ealler commented 2 years ago

Hi Casey. Again thanks for a fast and elaborate response. I understand, thanks for making that clear. Best, Emma

shunhuahan commented 1 year ago

Hi @ealler, I get the impression that there requires no further follow up. I'm closing this thread now, please feel free to reopen it if you need more help on this issue, or open new issues. Thanks!

Best, Shunhua

bergmanlab / ngs_te_mapper2

Setting the right quality thresholds #6