Open morganm804 opened 2 years ago
Sorry for the late reply. You were doing the right thing, although I wonder how many instances were attempting to align in this example. LINE families can be quite long and if you have more than 1000 copies it could take some time to align and might not be all that productive. This should typically run in a few minutes at the most. I am sure you have moved on since this request, however please let me know if you run into this problem again.
I am trying to curate RepeatModeler output using the guidelines in this paper (https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpz1.154). In particular, I am focused on curating LINE elements in a newly assembled corn snake genome to ensure that the consensus is fully extended to the bounds of the TE (we are looking at recombination events in snakes, and snakes have had pretty recent activity in their LINE elements, so it is important that we identify LINE regions in our genome as accurately as possible).
So far, I have run Repeat Modeler on the corn snake genome and ran RepeatMasker with those results. I am trying curation on the LINE element that has the most hits in the RepeatMasker run. I have a file with the consensus sequence of this family generated by RepeatModeler (head of file):
>ltr-1_family-4#LINE/CR1 [ Type=LTR, Final Multiple Alignment Size = 5 ] ATGCTTTCAGTCTGCTGAGCTACCAGGCCTGTTGCCACAAAGAAGAGCGG GTCAACTTATTTTCCAAACCACCAGAAGGGCAGGCCATGAAACAATGGAT GGATGGAAACTAATTGAGGAGAGAAGCAACCTGGAATTAAGGAGAAACTT CCTAACAGTGAGGACAATTTACCAGTGGAACGGCTTGCCATGAGNAGNTG TGGGCGCTCCATCACTGGAGGCTTTTAAGAAGAGACTGGACAGCCNCCCG TCTGAAACGGTACAGGNTCTCCTGCTTGAGCGGGGGGCTGGACTAGAAGA CCTCCAAGGTCCCTTCCGNCTCGTCCATTCTGTANCACACACGCACCCCC ACAGATGGCCCGGAGTTAATAAGCCACACTACAAAACTCTTTGAAGATAA AGCTAGCAGCAACCCAGTGGCTAGCTGCCAATTCAGACTTTACTCACACA
and a file with all instances of this family from the RepeatMasker output (head of file):
>Super_scaffold_1:24374-24588 ctgcatttggactaatccttgtattgcggaaactttgcctgctttatcggaatgcttgcagtctaatctttgttttgtgtgagtaaagtctgaattggcagctagccactgggttgctgccagctttatcttcaaagattttgtcgtgtggcttattaactctgggccgtctgtgggggtgcgtgtgtgggacggggacaaaacaggtccttggg
>Super_scaffold_1:27356-27776 catgatggcgaacctatggcatgcgtgccacaggtggcacgcggagccatatcagtaggcacgcaagctcagctctggcacacatgcgcgcaccagccagctgattttcaggcctttcaggcccactggaagtcggcaaacaggctatttccggccttcggagagcctctagggagctggagaaggtcattttcgccctccccaggctcctagaaaggctctggagcctggggagggcgaaaaacgggcctaccggggccaccatgccatcgcgtgccaaaagtggggggagtgcagggggggcggtcacgcacacatgcacggggtgcattgaattatgggtgtgggcacacacccaagcgaccccgctgcgctcctcccgcttttggcacgtgatggcaaaaaggttagccatcactgt
I ran
alignAndCallConsensus.pl -c family4_con.fa -e family4_elements.fa -int
in the TETools singularity container, and the program is taking over an hour to generate a suggested refinement for the first iteration. When I ran this program with the example files provided in the paper, everything worked perfectly.So to my questions... 1) Am I understanding the necessity of curation correctly here? 2) Why might alignAndCallConsensus be taking so long on my LINE family? 3) Are there any special characteristics of LINE families that I might be missing that are relevant here?
I am quite new to TE biology and identification, so I hope I am not missing anything obvious here!
Thank you in advance!