Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
189 stars 22 forks source link

issue with alignAndCallConsensus.pl script #221

Open s-travers opened 11 months ago

s-travers commented 11 months ago

I am following the TE curation guidelines published in the 2021 Storer et al. Current Protocols paper, and I am noticing an issue with the 'alignAndCallConsensus.pl' script on a dataset I'm testing (copia elements from Drosophila melanogaster). When I run the script interactively, extension of the consensus sequence seems to work fine when extending in both directions ('x' option). However, when I hit one of the TE edges and just want to continue extending either the 5' or 3' edge (using the '5' or '3' options) the script seems to ignore these options and always continues extending both edges. I get the same result if I start the extension process with just the '5' or '3' options and not 'x' (i.e., it still extends both directions anyway). Dr. Storer suggested the issue is due to the Hpad length (200), as she was able to recreate it using the Hpad of 200, down to Hpad lengths of 100 nucleotides, but an Hpad of 99 nt or less behaved as expected in terms of extension.

Reproduction steps

1) I am running this out of a Google Colab Notebook, using the Anaconda installation of RepeatModeler. 2) I run through 5 iterations with the interactive option adding 200bp H-pads using this command: "alignAndCallConsensus.pl -c copia_con.fa -e copia_elements.fa -int -ma 14 -hp 200" 3) After each of these iterations the script appears to run as it should, adding 200bp flanks on both ends. I accept the changes using 'x' since I don't run into any ambiguous sequence yet. 4) After iteration 5 the consensus appears to hit some ambiguous sequence on the 3' edge (screenshot attached: 'iteration5.png'), so I attempt to only extend the 5' edge by entering the '5' option, however as you can see in the screenshot 'iteration6.png', it still continues to extend both edges and consensus extends 400bp in length. (Note: I probably tried to stop extending the 3' edge prematurely here as it resolves that initial ambiguous sequence encountered on the previous iteration, but this is just to illustrate the issue) 5) I enter '5' again for the next iteration, just to see if it will respond appropriately this time. However, I get the same result, it continues to extend both edges and adds another 400 bases in total to the consensus ('iteration7.png').

Let me know if you need any other info. Thanks!

Iteration5 Iteration6 Iteration7