TE library construction and Input file quality control

abcyulongwang commented 6 months ago

Dear cgroza

Thank you for developing the GraffiTE software, he will obviously be cited a lot in the future, he is very helpful and inspiring to me at the moment! I currently have Nanopore data of nearly 100 samples and 30 high-quality genome data of this species. The genetic diversity between different individuals is very high, so my initial idea is to use EDTA+RepeatModeler+Rapbase to predict these thirty genomes TE will eventually form a perfect TE_library after removing redundancy. What I want to know is whether you recommend this strategy, because it may determine the accuracy of subsequent ONT data genotyping. In addition, it is well known that there are quite a few sequencing errors in ONT data. What I want to know is whether this will later affect the accuracy of TE detection. Do I need to use second-generation data to correct the ONT data? Will this have a big impact on the results?

Sincerely yulong

clemgoub commented 6 months ago

Dear Yulong,

Thank you for you kind words. We are happy to see that you find GraffiTE useful!

Regarding library building, I think there are two important points in your question. One is the expected genetic diversity between samples, and the second is which combination of tools is the most relevant for your use case. For the first point, I think your strategy is sound: doing de-novo search for each of the 30 genomes and then clustering across libraries to remove redundancy. This is very important as sometimes some low copy elements can be present in a single strain. In addition, GraffiTE expects by default up to 5% divergence between a sample and the reference genome (it is a minimap2 parameter), but it can be increased to 10% or 20% using --asm_divergence asm10 or --asm_divergence asm20. Note that we did not test the pipeline with these parameters.

Regarding the tools, from my experience, I would indeed recommend using RepeatModeler2 (with LTR module) or REPET. EDTA includes the core algorithm of Repeatmodeler, and in the hands of many user (and depending models) may face classification issues. In any ways all these tools do a good job at building raw libraries, but manual curation is strongly recommended. Also, I found that combining these tools only offer marginal gains. Recently, two downstream tools for RM2, REPET or EDTA have been made available, MCHelper and TETrimmer. These will help with clustering/diversity reduction but also with refining consensus sequence and providing suggestion for classification. Of course, using Repbase consensus relevant to your model is also encouraged. As you noted, redundancy reduction will be key eventually for a solid library.

Finally regarding ONT data, and though they are noisy, they seem to work well in our hands. Because the main target of GraffiTE are rather large insertions/deletions (instead of SNP and indels), we found that the most critical parameter is the read length rather than accuracy (we show that with simulations in humans). For example, ONT reads, are often longer than PacBio HiFi because the refinement of the raw PacBio reads leads to consensus reads of shorter size. In addition, GraffiTE asks Minimap to adjust its algorithm according to your read technology in order to optimize the results. So overall, you should expect only a minimal impact on the results.

Let us know what you think, and if you found a strategy that works for you, don't hesitate to share it here the help further users!

Cheers,

Clément

abcyulongwang commented 6 months ago

Thank you for your reply, I will follow your suggestions and recommend this software to more people Sincerely Yulong

cgroza / GraffiTE

TE library construction and Input file quality control #28