Which tool was used to identify repeats and tandem repeats in the genome?

MariaNattestad / Assemblytics

Assemblytics is a bioinformatics tool to detect and analyze structural variants from a genome assembly by comparing it to a reference genome.

http://assemblytics.com

MIT License

136 stars 28 forks source link

Which tool was used to identify repeats and tandem repeats in the genome? #28

Closed AyushSaxena closed 4 years ago

AyushSaxena commented 4 years ago

Thank you for this great tool. I am trying to interpret the results we have in the 'unique' part of the genome v/s the repeats. And I couldn't find the information on calling repeats in the genome on either the manuscript, supplementary material or here in issues (unless I missed it). Could you please clarify that?

Ayush

MariaNattestad commented 4 years ago

Hi Ayush

We didn't use another tool to call repeats in Assemblytics. The signal is already in the alignments themselves. When a sequence in the query matches more than one sequence in the reference, those 2+ sequences in the reference must be very similar to each other, meaning they are not unique and are therefore labeled as repeats. The supplementary material has examples of these with diagrams. I hope that clears it up!

AyushSaxena commented 4 years ago

Hi Maria,

Thank you for your quick response. This does clarify what I was asking. I have a related question (and I also tried looking for this specific information on Mummer's manual) - What is the smallest sequence that can be termed "not unique" in a contig. In my understanding nucmer reports all changes, no matter the size.

I have some interesting 'unique v repeat' findings in wild worms, and I am just trying to make sure I'm not explaining to the audience what my interpretation of the tool is!

Thank you Ayush

MariaNattestad commented 4 years ago

I think everything is a matter of which parameters you set in these tools.

When you run MUMmer, you might be using: nucmer -maxmatch -l 100 -c 500 ... the -l is "Minimum length of an maximal exact match" and -c is "Minimum cluster length", so these are how you tell Nucmer what you consider to be enough sequence matching that you trust the alignment. In Assemblytics, you can set the "unique sequence length required".

While the defaults are generally fine, it's a matter of experimentation and judgment to set these parameters based on what you know about your genome's size and repeat content. In other words, the smallest "not unique" sequence is up to your own judgment. You can also use tools that explicitly analyze repeat content of genomes (like RepeatMasker) or count kmers, if you want a more direct approach to get at smaller repeats. Best of luck with your research!