cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
121 stars 6 forks source link

max divergence options #10

Closed acontrerasg closed 1 year ago

acontrerasg commented 1 year ago

Hello! Thanks for developing this tool!

I was wondering what is reasoning behind a maximum of 5% divergence in the first step: pseudo-alignment using minimap. Specially as minimap allows for more divergence thresholds with asm5/asm10/asm20:

https://manpages.ubuntu.com/manpages/kinetic/en/man1/minimap2.1.html

Given that the TE family rule is the infamous 80/80/80 I could see asm20 working better for divergent populations? Although same TE family =! same allele.

Thanks!

clemgoub commented 1 year ago

Hello! Thanks for reaching out and your kind words!

The rationale was that initially, we designed the tool to focus on segregating TE insertions, originating from active TE families. We developed the pipeline using human and drosophila where my assumption is that most genome-to-genome comparison between individuals should be below 5% divergence. However, I may be wrong and/or it can indeed not be the case in different models. We can easily add this option, though we haven't yet tested these parameters. I will mark it as to do, and you can expect to see it implemented shortly!

Regarding the influence on the mighty "80/80/80" rule, I don't think this would change things: relaxing the divergence for genome-to-genome alignments will allow to create more contiguous alignments, but a polymorphic TE insertion will always create either an insertion or deletions relative to the reference genome. We attribute the TE family present in the inserted or deleted sequence from RepeatMasker, not the minimap alignment, which has no divergence limit (except the blastn limit). However, relaxing minimap2 divergence parameter may allow to align better most divergent regions of the genomes and may improve SV detection in these areas. Then, if we find and SV, its sequence will go into RepeatMasker to be annotated, regardless of the divergence between the TE and its consensus.

Let us know what you think. We are always interested to hear about user's models and their particular biology to refine our tools.

Best,

Clément

acontrerasg commented 1 year ago

Thanks for the quick reply!

Yes, I am curious to test how the divergence parameter will affect the initial SV calling and that enables the post Repetmasker selection. Thanks for including it as a future option!

I guess for now I can manually change the main.nf minimap parameter and test how it affects the SVs called. I see now why this divergence setting will not affect the RM filtering step, thanks for the clarification of how the tool works.

I am currently working with a geographically diverse set of plant genomes, so experimenting a bit with the parameters seems appropriate.

Best, Adrián

cgroza commented 1 year ago

We can parameterize this in the pipeline.

cgroza commented 1 year ago

This is now added to the sniffles branch. Will be merged in main eventually.

clemgoub commented 1 year ago

This feature has now been merged to main! Use --asm_divergence <asm5/asm10/asm20>

-- thanks @cgroza!