Closed adeoliveira86 closed 9 months ago
Hi,
For the easy answer: the tree is only used to order the columns in the output, which is a user convenience thing I remember adding so it was easier to look at the results, but won't affect the results themselves.
Regarding an output, unfortunately, this is old enough that I do not have outputs on hand. I would really recommend creating test files and running the code with them (will also allow you to see if everything runs right). For example I would use 100 lines from the repeat masker .out file you have, and provide 2 genomes (the same one as the RM .out file and another one), or run it with -filter
set to one repeat only, to limit the number of repeats that are blasted. Nothing really replaces looking at the data on your own for a small sample size, with data you know, before running the whole thing!
The code will take copy+flanking sequences when the flanking are not a TE in the RM.out file (this is the first caveat: this requires well annotated genomes as input), then blast that. The output will have columns for each provided genomes, with counts of extracted copies, then the number of total hits, and then hits with the flanking sequences present (when the flanking sequences are present, and you know the input was not nested in some repeats, then this is in favor of this copy being orthologous between the genomes). Note that the results are rarely clear cut; you'd expect false negatives and false positives.
The code is blast based. It uses blast defaults, and the cutoff evalue is set by the user with -e
. Lowering it will speed things up but it depends what divergence you expect between your most diverged genomes.
Hello @4ureliek,
thanks for the swift answer. I will follow your instructions and try to narrow down the analyses to a couple of TE families which interest me the most. I am afraid computing all elements will be extremely time-consuming.
A short additional question. Is there any available scripts designed to plot the results originated by parseRM (including the repeat landscape). I have written a few Rscripts to parse the divsum files generated by calcDivergenceFromAlignment.pl and I guess I can adapt those for this purpose.
Cheers, André
Yes you are correct, it will likely take a while even with parallelization, since blast is just not that fast, and the code not optimized (e.g. if blast jobs are too large, splitting query and DBs into chunks speeds things up massively).
I have not written code to plot automatically the repeat landscapes as far as I can remember - I would need to go dig into my archives, but for these landscapes I made the paper figures with Illustrator, not R (https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.13295)
I will take the "splitting" into consideration. Let's see what will come from my analyses, but at any case, thanks again for your help and scripts.
Oh, I meant inside the code - splitting the inputs will be annoying to do, unless it's with using the filter
option. Let me know how it goes!
Hey @adeoliveira86 - I found an example of output; it was ran with cat, dog and human.
Columns:
#"Total_hits" means total of hits with evalue less than the one set through -e
#"ortho_hits" means POTENTIALLY orthologous elements = hits that contain at least 20% of the extracted flanking sequence
#Acess to individual copies through files in ./orthology.Cat.Dog.Cow.Human/DATA/*.extract/*.out.ortho.tab
For example:
[Query_species] | Target_species#(with tree info): | |||||||
---|---|---|---|---|---|---|---|---|
(hg38 | , | ((canFam3 | , | felCat5) | , | |||
hg38 | hg38 | canFam3 | canFam3 | felCat5 | felCat5 | |||
Rname | [name] | [Nb_extracted_seq] | Total_hits | Ortho_hits | Total_hits | Ortho_hits | Total_hits | Ortho_hits |
Tigger1#DNA/TcMar-Tigger | canFam3 | 2607 | 3173 | 0 | 3348 | 2173 | 1649 | 35 |
In this case you can see that most of the dog copies of Tigger1#DNA/TcMar-Tigger are recovered as potentially orthologous (it is never 100% because of how blast will work); human has none, and cat has so few they are probably false positives.
I used this to check whether TEs looked like dog/cat ones or more ancient than that. You'd have to decide on your cutoffs depending on the 'recovery rate' and how some TEs you know are shared behave (depends on quality of assemblies, divergence between species, etc).
Hello @4ureliek,
thanks for sharing the output. This looks like something that could be extremely useful for the questions I am asking. Do you mind if I get in touch with you via email to discuss a little bit more what I am envisaging and have your feedback?
Sure thing - you can my gmail.com
email, with prefix 4urelie.k
Dear @4ureliek,
Thank you for all the available and resourceful scripts. I have been using parseRM and now I am playing around with TEorthology and it is making my life really easier. I would like some piece of advice of TE orthology before I run a massive analysis with it. My intended goal is to identify how a bunch of TE elements already identified across a invertebrate lineage are evolutionary related to each other. My ultimate goal is to identify the core TE elements (TEs found in all analyses organisms), lineage-specific ones, and also some that are shared between between two or more animals. In short words, I would like to generate a pan-repeatome for my clade of interest. I think based on the description of your tool that it would be idea for this kinda of analysis. Then I would use this to reannotate the genomes.
I would like to know if I it is possible to see the output of the software to check if this would fit my goals before investing a lot of time into running the software. Since this is virtually a all-against-all comparison and the .out RM files are quite massive, I am assuming this will take a fair amount of time. Additionally, I have a question about the parameter tree used by the software. Is this tree used as a backbone to infer presence/absence in the species I am analysing? Running with and without it changes much the output of running time?
Best regards and thanks again for the scripts, André